Crawling
Crawling is the process by which search engine bots discover and download web pages by following links and sitemaps. It is the step immediately before indexing, where content is analyzed and stored in the search database, so a page that has not been crawled cannot appear in search results.
- Crawling is the action or process by which a search engine's automated bot (for example, Googlebot) discovers and downloads web pages by following links and sitemaps.
- Google Search operates in three stages — crawling, indexing, and serving search results — and crawling is the first of them.
- It proceeds through discovery (finding URLs), fetching (downloading the page), and rendering (executing JavaScript).
- Crawling is controlled with
robots.txt, but that file governs crawler access rather than blocking indexing. - The agent that performs crawling is the crawler (bot), the state of being easy to crawl is crawlability, and the allotted amount of crawling is the crawl budget — each distinct from crawling itself.
Overview
Crawling is the process in which a search engine uses automated programs to discover web pages across the internet and download their content (text, images, and video). In its official documentation, Google explains that Search works in three stages — crawling, indexing, and serving search results — and defines crawling as the stage where "Google downloads text, images, and videos from pages it found on the internet with automated programs called crawlers." In other words, crawling precedes indexing: for a page to surface in search, it must first be crawled and then indexed.
Here, crawling refers to the action or process itself. It should be distinguished from the crawler (the bot that carries it out), crawlability (how readily a site can be crawled), and crawl budget (the amount of crawling a search engine allocates to a given site).
How It Works
Following Google's documentation, crawling proceeds in three broad steps.
- Discovery (URL discovery): The search engine finds new pages by extracting links from pages it already knows. Google gives the example of "a hub page, such as a category page, linking to a new blog post." Sitemaps submitted by site owners are another path to discovery.
- Fetching: The bot accesses the URL over HTTP and downloads the page. Google crawls a given site roughly once every few seconds on average, and bot types can be identified through the HTTP user-agent header.
- Rendering: After fetching a page, Googlebot renders it using a recent version of Chrome, executing JavaScript in the process. Google notes that "websites often rely on JavaScript to show content," and without rendering Google may not see that content.
Googlebot comes in two variants, mobile and desktop, and under its mobile-first indexing policy Google performs "the majority of crawls using its mobile crawler."
Controlling Crawling
Site owners can control a bot's crawl access with robots.txt. Google defines robots.txt as a file that "tells crawlers which URLs they can access on your site," used primarily to avoid overloading the server or to reduce crawling of unimportant pages.
User-agent: *
Disallow: /private/
Sitemap: https://example.com/sitemap.xml
However, robots.txt controls crawling only and is not a means of preventing indexing. Google states that robots.txt "is not a mechanism for keeping a web page out of Google," and warns that a blocked URL can still appear in search results without a description snippet if other sites link to it. To keep a page out of search results you must use noindex, which allows crawling but excludes the page from the index.
Crawling vs. Adjacent Concepts
| Term | Meaning | Nature |
|---|---|---|
| Crawling | The process by which a bot discovers and collects pages | Action / process |
| Crawler | The bot that performs crawling (for example, Googlebot) | Agent / program |
| Crawlability | How easily a site can be crawled | State / property |
| Crawl budget | The amount of crawling a search engine allocates to a site | Resource / allowance |
Basis
The definitions and process described here all rest on Google Search Central's official documentation. "In-Depth Guide to How Google Search Works" describes the three stages of crawling, indexing, and serving, along with URL discovery and rendering; "What Is Googlebot" covers the mobile and desktop crawler distinction and crawl frequency; and the "robots.txt Introduction" sets out the principle that robots.txt controls crawling without blocking indexing.