Contact
SEO

Por qué tu sitemap.xml probablemente está mal

Empirium Team8 min read

Your sitemap.xml is a direct communication channel with Google's crawler. It tells Googlebot which pages exist, when they were last updated, and how they relate to alternate language versions. Most sites waste this channel by including pages Google shouldn't index, omitting pages it should, and lying about update dates.

A 2024 study by Screaming Frog found that 68% of sitemaps contain at least one critical error. The most common: including non-canonical URLs, stale lastmod dates that don't reflect actual changes, and missing hreflang annotations for multi-language sites.

Getting your sitemap right doesn't improve rankings directly — but it dramatically affects how quickly Google discovers and indexes your content. For sites with more than 10,000 pages, it directly impacts crawl budget efficiency.

What a Sitemap Should (and Should Not) Contain

A sitemap should contain exactly one entry for every URL that meets ALL of these criteria:

  • Returns HTTP 200 (not 301, 404, or 500)
  • Has index, follow directives (not noindexed)
  • Is the canonical version (self-referencing canonical, not pointing elsewhere)
  • Contains substantive content (not thin, auto-generated, or parameter variations)

Everything else is noise that dilutes your sitemap's signal.

What to Exclude

URL Type Why Exclude Common Mistake
Redirect URLs (301/302) Google follows the redirect, sitemap entry is ignored CMS includes old slugs after URL changes
Noindexed pages Contradicts the noindex directive Login, admin, or staging pages left in sitemap
Non-canonical URLs Google uses the canonical, ignores duplicates Parameter URLs (?sort=, ?page=) included
Paginated archives (/page/2/, /page/3/) Usually thin, duplicate content WordPress auto-includes pagination
Search result pages (/search?q=) Infinite URL space, thin content Internal site search URLs crawled
Tag/category pages (if thin) Low content value, cannibalize pillar content Auto-generated with 1-2 posts each

A Clean Sitemap Entry

<url>
  <loc>https://empirium.io/blog/core-web-vitals-optimization-2026</loc>
  <lastmod>2026-05-08</lastmod>
  <changefreq>monthly</changefreq>
  <priority>0.8</priority>
  <xhtml:link rel="alternate" hreflang="en" href="https://empirium.io/blog/core-web-vitals-optimization-2026"/>
  <xhtml:link rel="alternate" hreflang="fr" href="https://empirium.io/fr/blog/core-web-vitals-optimization-2026"/>
  <xhtml:link rel="alternate" hreflang="x-default" href="https://empirium.io/blog/core-web-vitals-optimization-2026"/>
</url>

Note: Google has publicly stated it ignores changefreq and priority. They're technically valid per the protocol but have zero impact on crawl behavior. Include them if your tooling generates them automatically, but don't invest time optimizing them.

The Five Most Common Mistakes

1. Stale lastmod Dates

The lastmod field should reflect when the page content actually changed — not when the server regenerated the page, not a hardcoded date, and definitely not today's date on every page.

Google uses lastmod to prioritize which pages to re-crawl. If every page says it was modified today, Google learns that your lastmod is unreliable and starts ignoring it entirely. Then when you actually update important content, Google has no signal to re-crawl it promptly.

Fix: Generate lastmod from your content's actual modification timestamp. In a CMS, use the content's updatedAt field. In a git-based workflow, use the file's last commit date.

2. Missing Hreflang Alternates

If your site has multiple language versions, every sitemap entry should include xhtml:link annotations for all language variants. This is the most scalable method for hreflang implementation and should be used alongside or instead of HTML <link> tags for large sites.

Fix: Generate hreflang entries programmatically. For every URL, include an xhtml:link for each locale where the content exists, plus x-default.

3. Including Non-Canonical URLs

If page A has <link rel="canonical" href="/page-b">, page A should not be in your sitemap. It tells Google: "This URL matters" while simultaneously telling it via canonical: "Actually, the real URL is somewhere else." Mixed signals confuse crawlers.

Fix: Before adding a URL to the sitemap, verify it's self-canonical. Automate this check in your sitemap generation logic.

4. Exceeding File Size Limits

The sitemap protocol limits each sitemap file to 50,000 URLs and 50MB uncompressed. Many sites hit these limits without realizing it, resulting in truncated sitemaps where the newest (and often most important) pages are cut off.

Fix: Use sitemap index files when approaching limits. Split by content type or section:

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://empirium.io/sitemap-pages.xml</loc>
    <lastmod>2026-05-08</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://empirium.io/sitemap-blog.xml</loc>
    <lastmod>2026-05-08</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://empirium.io/sitemap-services.xml</loc>
    <lastmod>2026-05-08</lastmod>
  </sitemap>
</sitemapindex>

5. Wrong Protocol (HTTP vs HTTPS)

If your site serves over HTTPS but your sitemap contains HTTP URLs, Google may treat them as different pages. This is especially common after SSL migrations where the sitemap template wasn't updated.

Fix: Ensure all sitemap URLs use the same protocol as your canonical URLs. If you serve HTTPS, every <loc> should start with https://.

Dynamic Sitemap Generation

Static sitemaps break the moment you add or remove content. Dynamic generation ensures your sitemap always reflects reality.

Next.js Implementation

Next.js provides a built-in sitemap.ts route handler:

import { MetadataRoute } from 'next';

const locales = ['en', 'fr', 'de', 'es', 'it', 'pt', 'nl', 'pl',
  'ru', 'zh', 'ja', 'ko', 'ar', 'hi', 'tr', 'sv', 'no', 'da', 'fi', 'cs'];
const baseUrl = 'https://empirium.io';

export default async function sitemap(): Promise<MetadataRoute.Sitemap> {
  const pages = await getAllPages(); // from CMS or filesystem
  
  return pages.map(page => ({
    url: `${baseUrl}${page.path}`,
    lastModified: page.updatedAt,
    alternates: {
      languages: Object.fromEntries(
        locales.map(locale => [
          locale,
          locale === 'en'
            ? `${baseUrl}${page.path}`
            : `${baseUrl}/${locale}${page.path}`
        ])
      ),
    },
  }));
}

This automatically generates a sitemap with correct URLs, real lastmod dates, and hreflang alternates for every locale. No manual maintenance required.

Validation After Generation

Always validate your generated sitemap:

# Check URL count
curl -s https://yoursite.com/sitemap.xml | grep -c '<loc>'

# Check for non-200 URLs (sample)
curl -s https://yoursite.com/sitemap.xml | grep -oP '<loc>\K[^<]+' | \
  head -50 | xargs -I{} curl -sI {} | grep "HTTP/"

# Validate XML structure
xmllint --schema http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd sitemap.xml

Sitemap Index Files for Large Sites

Sites with more than 10,000 URLs should use sitemap index files. Benefits:

  • Logical organization: Separate sitemaps by content type (blog, services, products)
  • Partial updates: When blog content changes, only the blog sitemap needs regeneration
  • Faster processing: Google can process smaller files more efficiently
  • Better diagnostics: Search Console reports indexing status per sitemap

Splitting Strategy

Sitemap Contents Update Frequency
sitemap-pages.xml Service pages, about, contact Monthly
sitemap-blog.xml Blog articles Weekly
sitemap-locale-fr.xml French locale pages Weekly
sitemap-locale-de.xml German locale pages Weekly

For sites with 20 locales and 500 pages per locale, that's 10,000 URLs — enough to warrant splitting by locale. Each locale sitemap contains only the pages available in that language.

Submission

Submit your sitemap index file (not individual sitemaps) to Google Search Console and Bing Webmaster Tools. Include it in robots.txt:

Sitemap: https://empirium.io/sitemap-index.xml

Google will discover and process all child sitemaps from the index file.

FAQ

How often should I resubmit my sitemap?

You don't need to manually resubmit. Once Google knows about your sitemap (from robots.txt or Search Console), it re-fetches it periodically. If you want Google to notice changes faster, ping the sitemap URL: https://www.google.com/ping?sitemap=https://yoursite.com/sitemap.xml. For urgent indexing of new pages, use Search Console's URL Inspection tool to request indexing directly.

Does Bing handle sitemaps differently than Google?

Bing relies more heavily on sitemaps than Google for content discovery. While Google discovers most pages through crawling links, Bing uses sitemaps as a primary discovery mechanism. Submit your sitemap to Bing Webmaster Tools separately, and use IndexNow for instant notification of new or updated URLs — Bing supports IndexNow natively, and it's also supported by Yandex.

Should I include images in my sitemap?

Image sitemaps (<image:image>) can help Google discover images that aren't easily found through regular crawling (e.g., JavaScript-loaded galleries). For most sites, standard sitemaps are sufficient because images are embedded in HTML that Google already crawls. If your site has significant image-based traffic (photography, e-commerce), add image sitemap entries for product and gallery pages.

What's the relationship between sitemap priority and crawl frequency?

None. Google ignores the <priority> field entirely. John Mueller has confirmed this publicly multiple times. The <lastmod> field is the only timing signal Google uses from sitemaps. Focus on accurate lastmod dates rather than priority values.

Can a bad sitemap hurt my SEO?

A bad sitemap won't directly hurt your rankings, but it wastes crawl budget and delays indexing. If your sitemap includes thousands of noindexed, redirected, or 404 URLs, Google spends crawl resources on pages that shouldn't be indexed — leaving less budget for your actual content. For large sites, this can significantly delay the indexing of new content. Our guide on crawl budget optimization covers this in detail.

Written by Empirium Team

Explore More

Deep-dive into related topics across our five pillars.

Pillar Guide

SEO internacional en 2026: la guía del operador para múltiples regiones

La guía técnica y estratégica completa para posicionar en múltiples países e idiomas.

View all SEO articles

Related Resources

Need help with this?

Talk to Empirium