What is a Sitemap?

The Sitemap allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site.

What is a Sitemap?



A sitemap is a file in which you give information about the pages, videos, and other files on your website and in which you indicate the relationships between these files. Search engines such as Google read this file to explore your site more intelligently. A sitemap tells Google which files you think are essential on your website and also provides valuable information about those files.

For example, it shows when a page was last updated, how often it was changed, and what versions exist in other languages.

If the pages of your site are correctly linked,  crawlers can usually discover most of your website. Nevertheless, sending a sitemap to Search Engines helps to improve the exploration of large, very complicated, or very specialized websites.

Sitemap

The Sitemap allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. Most websites, such as Primates.dev have their Sitemap located at /sitemap.xml.

Let's look at our Sitemap. Go to https://primates.dev/sitemap.xml.
You should end up on a page that looks like the image below.

Sitemap Primates.dev

You can see that we have four sitemaps in our sitemap.xml. Usually, websites separate their content in different sitemaps.

  • sitemap-pages.xml references all the pages of our website that are not posts.
  • sitemap-posts.xml references all the articles on our website. Useful if you want to parse the article's content of a website.
  • sitemap-authors.xml references all the authors of our website. It is a specific sitemap to our website because we have multiple authors.
  • sitemap-tags.xml references all the tags of our website.

Sitemaps are present on almost every website on the web. It is a file that allows search engines to easily find new pages on a website without having to crawl it. What if you can't find your Sitemap? You can look at your robots.txt file.
Go to <yourwebsite>/robots.txt.

As an example, we'll take a look at https://primates.dev/robots.txt.

Robots.txt Primates
  • Sitemap: References the sitemaps of a website. You don't always have the sitemaps in the robots.txt, but most websites have their Sitemap in this file.
  • Disallow: Pages we don't want to be crawled or references in search engines. For example,/ghost/is the address we use to go to our admin.
  • User-agent: Defines the rules that have to be followed by the specified user-agent. Here we determined that the following rules apply to everybody.

No luck with the robots.txtfile ? Well then you need some help.

Easily find a website's Sitemap

Method 1:

Here is a list of the most common sitemaps we have found. This is the most popular sitemaps URI path we have seen. Please note that it can change based on the technologies popular at the moment used for creating websites. However, most websites have the same URI for their sitemaps, and therefore it is usually easy to find it. Whether in the robots.txt or by just trying.

  • /sitemap.xml
  • /feeds/posts/default?orderby=updated
  • /sitemap.xml.gz
  • /sitemap_index.xml
  • /s2/sitemaps/profiles-sitemap.xml
  • /sitemap.php
  • /sitemap_index.xml.gz
  • /vb/sitemap_index.xml.gz
  • /sitemapindex.xml
  • /sitemap.gz
  • /sitemap_news.xml
  • /sitemap-index.xml
  • /sitemapindex.xml
  • /sitemap-news.xml
  • /post-sitemap.xml
  • /page-sitemap.xml
  • /portfolio-sitemap.xml
  • /home_slider-sitemap.xml
  • /category-sitemap.xml
  • /author-sitemap.xml

With this extensive list of common sitemaps URI path you should be able to find what you are looking for. If the sitemap is not in the previous list then method 2 should give you the answer.

Method 2:

Let's use Google to find the remaining Sitemaps. This method usually only works for big websites such as news websites.
Go to Google.com and type site:<url_website> filetype:xml
For example: site:theguardian.com filetype:xml

If you have any luck you should find a sitemap.

As you can see, we found a sitemap that isn't on our list. It is a simple and effective way to find sitemaps for huge websites. Depending on the website, you can or cannot find sitemaps with this method.

How to read sitemaps

Let's take a look at the source code of a sitemap. Go to https://primates.dev/sitemap.xml and look at the source code.


<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="//primates.dev/sitemap.xsl"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap><loc>https://primates.dev/sitemap-pages.xml</loc><lastmod>2020-02-29T13:17:58.206Z</lastmod></sitemap>
<sitemap><loc>https://primates.dev/sitemap-posts.xml</loc><lastmod>2020-03-02T11:28:46.333Z</lastmod></sitemap>
<sitemap><loc>https://primates.dev/sitemap-authors.xml</loc><lastmod>2020-03-02T17:24:31.387Z</lastmod></sitemap>
<sitemap><loc>https://primates.dev/sitemap-tags.xml</loc><lastmod>2020-03-01T00:39:53.448Z</lastmod></sitemap>
</sitemapindex>
Sitemap Index

This is what we call a sitemap index. It is a list of other sitemaps. As mentioned above, it references all the sitemaps of our website.

  • loc: URL of the sitemap
  • lastmod: last time the Sitemap was modified.

This sitemap index file is essential for crawlers. It is the entry point for Google's crawlers. With this file, it can know where all the other sitemaps are and if they changed since the last crawl.

Let's take a look at https://primates.dev/sitemap-posts.xml. Same thing, look at the source code.

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="//primates.dev/sitemap.xsl"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
<url>
	<loc>https://primates.dev/brave-says-no-to-error-404/</loc>
    <lastmod>2020-03-02T11:28:46.280Z</lastmod>
    <image:image>
    	<image:loc>https://primates.dev/content/images/2020/02/404-brave-not-found-primates-dev.jpg</image:loc>
    	<image:caption>404-brave-not-found-primates-dev.jpg</image:caption>
    </image:image>
</url>
<url>
	<loc>https://primates.dev/create-games-directly-in-your-browser-babylonjs/</loc>
    <lastmod>2020-02-29T23:49:11.000Z</lastmod>
    <image:image>
    	<image:loc>https://primates.dev/content/images/2020/02/gaming-babylonjs-primates-dev.jpg</image:loc>
    	<image:caption>gaming-babylonjs-primates-dev.jpg</image:caption>
    </image:image>
</url>
</urlset>

This is only an extract of the sitemap. The information found on this sitemap is the same for all the URLs.

  • urlset: Indicates that it is a list of URLs
  • url: Indicates a URL
  • loc: URL of the page
  • lastmod: Last time the page was changed
  • image:image: Indicates that it is an image
  • image:loc: URL of the featured image of the article
  • image:caption: Caption of the image

How to send your sitemap to Google ?

You have found or created your sitemap; then, it is time to send it to Google. It won't boost your position in Google's results but it will help Google in indexing your website. You'll also gain more insight on the pages that Google indexed on your website.

Go to Google Search Console

Create an account. Verify your website.

Then go to Index->Sitemaps and input the url of your sitemap.

Google Search Console Sitemap Primates.dev

You are all set and done ! Now Google will parse your sitemap periodically.