A Background on Googlebot Crawling: WMW Blog, December 2024

A necessary process for web pages to show up in Google Search results is the automated crawling of websites that are publicly available. Googlebot is a program running on Google servers that retrieves a URL and handles things like network errors, redirects, and other small complications that it might encounter as it works its way through the web.

Crawling is the process of discovering new and revisiting updated web pages and downloading them. Googlebot gets a URL, makes an HTTP request to the server hosting it and then deals with the response from that server, possibly following handling errors, and passing the page content on to Google’s indexing system.

Beyond HTML, modern websites use a combination of different technologies such as JavaScript and CSS to offer users vibrant experiences and useful functionalities. When accessing such pages with a browser, the browser first downloads the parent URL which hosts the data needed to start building the page for the user — the HTML of the page.

This initial data may contain references to resources like JavaScript and CSS, but also images and videos that the browser will once again download to eventually construct the final page which is then presented to the user.

Google does exactly the same thing, though slightly differently:

Googlebot downloads the initial data from the parent URL — the HTML of the page.
Googlebot passes on the fetched data to the Web Rendering Service (WRS).
Using Googlebot, WRS downloads the resources referenced in the original data.
WRS constructs the page using all the downloaded resources as a user’s browser would.

Compared to a browser, the time between each step may be significantly longer due to scheduling constraints such as the perceived load of the server hosting the resources needed for rendering a page.

From the site owners’ perspective, to manage how and what resources are crawled, Google recommends:

Use as few resources as feasible to offer users a great experience; the fewer resources are needed for rendering a page, the less crawl budget is spent during rendering.
Use cache-busting parameters cautiously: if the URLs of resources change, Google may need to crawl the resources again, even if their contents haven’t changed. This, of course, will consume crawl budget.
Host resources on a different hostname from the main site, for example by employing a CDN or just hosting the resources on a different subdomain. This will shift crawl budget concerns to the host that’s serving the resources.

The best source to analyse what resources Google is crawling is the site’s raw access logs which has an entry for every URL that was requested by browsers and crawlers alike.

If you want to know more about Googlebot crawling, please get in touch.