Have you ever wanted to extract all the URLs of a website quickly? We'll tell you how! Usually crawlers visit every pages of a website and indexes them. This method is extremely slow and often results in crawlers taking hours to crawl a website.
Most websites have a file called Sitemap that lists all the URLs. If we manage to find this file we can find all the URLs of a website in seconds. We'll look at a simple way to extract all the URLs of a website based on its Sitemap. It is hundreds of times faster than crawling all the pages of a website to find all of its URLs.
The Sitemap allows a webmaster to inform search engines about URLs on a website that are available for crawling. A Sitemap is an XML file that lists the URLs for a site. Most websites, such as Primates.dev have their Sitemap located at
Let's look at our Sitemap. Go to https://primates.dev/sitemap.xml.
You should end up on a page that looks like the image below.
You can see that we have four sitemaps in our
sitemap.xml. Usually, websites separate their content in different sitemaps.
- sitemap-pages.xml references all the pages of our website that are not posts.
- sitemap-posts.xml references all the articles on our website. Useful if you want to parse the article's content of a website.
- sitemap-authors.xml references all the authors of our website. It is a specific sitemap to our website because we have multiple authors.
- sitemap-tags.xml references all the tags of our website.
Sitemaps are present on almost every website on the web. It is a file that allows search engines to easily find new pages on a website without having to crawl it. What if you can't find your Sitemap? You can look at your
As an example, we'll take a look at https://primates.dev/robots.txt.
- Sitemap: References the sitemaps of a website. You don't always have the sitemaps in the robots.txt, but most websites have their Sitemap in this file.
- Disallow: Pages we don't want to be crawled or references in search engines. For example,/ghost/is the address we use to go to our admin.
- User-agent: Defines the rules that have to be followed by the specified user-agent. Here we determined that the following rules apply to everybody.
No luck with the
robots.txtfile ? Well then you need some help.
Here is a list of the most common sitemaps we have found. This is the most popular sitemaps URI path we have seen. Please note that it can change based on the technologies popular at the moment used for creating websites. However, most websites have the same URI for their sitemaps, and therefore it is usually easy to find it. Whether in the
robots.txt or by just trying.
With this extensive list of common sitemaps URI path you should be able to find what you are looking for. If the sitemap is not in the previous list then method 2 should give you the answer.
Let's use Google to find the remaining Sitemaps. This method usually only works for big websites such as news websites.
Go to Google.com and type
If you have any luck you should find a sitemap.
As you can see, we found a sitemap that isn't on our list. It is a simple and effective way to find sitemaps for huge websites. Depending on the website, you can or cannot find sitemaps with this method.
How to read sitemaps
Let's take a look at the source code of a sitemap. Go to https://primates.dev/sitemap.xml and look at the source code.
This is what we call a sitemap index. It is a list of other sitemaps. As mentioned above, it references all the sitemaps of our website.
- loc: URL of the sitemap
- lastmod: last time the Sitemap was modified.
This sitemap index file is essential for crawlers. It is the entry point for Google's crawlers. With this file, it can know where all the other sitemaps are and if they changed since the last crawl.
Let's take a look at https://primates.dev/sitemap-posts.xml. Same thing, look at the source code.
<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="//primates.dev/sitemap.xsl"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"> <url> <loc>https://primates.dev/brave-says-no-to-error-404/</loc> <lastmod>2020-03-02T11:28:46.280Z</lastmod> <image:image> <image:loc>https://primates.dev/content/images/2020/02/404-brave-not-found-primates-dev.jpg</image:loc> <image:caption>404-brave-not-found-primates-dev.jpg</image:caption> </image:image> </url> <url> <loc>https://primates.dev/create-games-directly-in-your-browser-babylonjs/</loc> <lastmod>2020-02-29T23:49:11.000Z</lastmod> <image:image> <image:loc>https://primates.dev/content/images/2020/02/gaming-babylonjs-primates-dev.jpg</image:loc> <image:caption>gaming-babylonjs-primates-dev.jpg</image:caption> </image:image> </url> </urlset>
This is only an extract of the sitemap. The information found on this sitemap is the same for all the URLs.
- urlset: Indicates that it is a list of URLs
- url: Indicates a URL
- loc: URL of the page
- lastmod: Last time the page was changed
- image:image: Indicates that it is an image
- image:loc: URL of the featured image of the article
- image:caption: Caption of the image
Isn't it better than common crawlers ? We have a way to find all the urls of a website. We just have to find a way to extract all this information now. Fortunately for us, here is a small piece of code that does the job :D
Code for parsing sitemaps
Here is the code.
pip install beautifulsoup4 pip install pandas
Not a lot of libraries needed. It is a fairly simple code.
Crawler & Parser function
import requests from bs4 import BeautifulSoup as Soup import pandas as pd import hashlib # Pass the headers you want to retrieve from the xml such as ["loc", "lastmod"] def parse_sitemap( url,headers): resp = requests.get(url) # we didn't get a valid response, bail if (200 != resp.status_code): return False # BeautifulSoup to parse the document soup = Soup(resp.content, "xml") # find all the <url> tags in the document urls = soup.findAll('url') sitemaps = soup.findAll('sitemap') new_list = ["Source"] + headers panda_out_total = pd.DataFrame(, columns=new_list) if not urls and not sitemaps: return False # Recursive call to the the function if sitemap contains sitemaps if sitemaps: for u in sitemaps: sitemap_url = u.find('loc').string panda_recursive = parse_sitemap(sitemap_url, headers) panda_out_total = pd.concat([panda_out_total, panda_recursive], ignore_index=True) # storage for later... out =  # Creates a hash of the parent sitemap hash_sitemap = hashlib.md5(str(url).encode('utf-8')).hexdigest() # Extract the keys we want for u in urls: values = [hash_sitemap] for head in headers: loc = None loc = u.find(head) if not loc: loc = "None" else: loc = loc.string values.append(loc) out.append(values) # Creates a dataframe panda_out = pd.DataFrame(out, columns= new_list) # If recursive then merge recursive dataframe if not panda_out_total.empty: panda_out = pd.concat([panda_out, panda_out_total], ignore_index=True) #returns the dataframe return panda_out
First of all we make a request to the specified url in the function parameters.
resp = requests.get(url) # we didn't get a valid response, bail if (200 != resp.status_code): return False
Then we parse the content of the response using BeautifulSoup4.
# BeautifulSoup to parse the document soup = Soup(resp.content, "xml")
Then we look for either a urlset or a sitemapindex
# find all the <url> tags in the document urls = soup.findAll('url') sitemaps = soup.findAll('sitemap')
If we are in a sitemapindex such as https://primates.dev/sitemap.xml we recursively call the function passing the URL (loc) of the sitemap.
# Recursive call to the the function if sitemap contains sitemaps if sitemaps: for u in sitemaps: sitemap_url = u.find('loc').string panda_recursive = parse_sitemap(sitemap_url, headers) panda_out_total = pd.concat([panda_out_total, panda_recursive], ignore_index=True)
Then we create a hash of the sitemap urls for better indexing
# Creates a hash of the parent sitemap hash_sitemap = hashlib.md5(str(url).encode('utf-8')).hexdigest()
We only have one step to finish the extraction. Parse the information of the sitemap.
# Extract the keys we want for u in urls: values = [hash_sitemap] for head in headers: loc = None loc = u.find(head) if not loc: loc = "None" else: loc = loc.string values.append(loc) out.append(values)
The functions takes a headers as parameters.
headers parameter is a list of all the information you want to retrieve from the sitemap.
The program returns a panda dataframe for easier management down the line.
parse_sitemap("https://primates.dev/sitemap-posts.xml", ["loc", "lastmod" ])
Source loc lastmod 0 7e3bc65a80810f933f22b0b2db05d8d6 https://primates.dev/brave-says-no-to-error-404/ 2020-03-02T11:28:46.280Z 1 7e3bc65a80810f933f22b0b2db05d8d6 https://primates.dev/create-games-directly-in-... 2020-02-29T23:49:11.000Z 2 7e3bc65a80810f933f22b0b2db05d8d6 https://primates.dev/ddos-with-a-crapy-computer/ 2020-02-28T12:43:34.438Z 3 7e3bc65a80810f933f22b0b2db05d8d6 https://primates.dev/parsing-an-api-xml-respon... 2020-02-27T22:11:12.323Z ... 18 7e3bc65a80810f933f22b0b2db05d8d6 https://primates.dev/optimize-your-website-in-... 2020-02-18T14:45:11.000Z
Extract all the urls of https://primates.dev/sitemap.xml
dataframe = parse_sitemap("https://primates.dev/sitemap.xml", ["loc" ])
Source loc 0 abf73adfd112dfa0235f39ac8ef9e6ec https://primates.dev/the-team/ 1 abf73adfd112dfa0235f39ac8ef9e6ec https://primates.dev/become-an-author/ 2 abf73adfd112dfa0235f39ac8ef9e6ec https://primates.dev/categories/ 3 abf73adfd112dfa0235f39ac8ef9e6ec https://primates.dev/ 4 7e3bc65a80810f933f22b0b2db05d8d6 https://primates.dev/brave-says-no-to-error-404/ ... 44 bd98aba14cd9e52313c0ae77dc97f892 https://primates.dev/tag/seo/ 45 bd98aba14cd9e52313c0ae77dc97f892 https://primates.dev/tag/ads/
I hope that this little piece of code will be useful. Please feel free to comment if you find new sitemaps. I'll be more than delighted to see what kind of projects you do using this little piece of code. Have fun ! I hope it showed you that standard crawlers are not always the answer.
Link to the Gist of the script here