What is robots.txt?

robots.txt is a UTF-8 text file at the site root that tells crawlers which paths they may or may not access (crawl).
Its purpose is managing crawl traffic, not blocking indexing. Google states explicitly that it "is not a mechanism for keeping a web page out of Google."
If another site links to a blocked URL, that page can still appear in search results without a description.
The core directives are User-agent, Disallow, Allow, and Sitemap, and path matching follows the "longest match" rule.
It was standardized as RFC 9309 in September 2022, and because it is not an access-control mechanism, it should not be used for security.

Definition

robots.txt is a text file placed at a website's root (for example, https://www.example.com/robots.txt) that tells search engine crawlers which URLs on the site they may access (crawl). Google Search Central defines robots.txt as "a file that tells crawlers which URLs they can access," and explains that its main purpose is to manage crawl traffic, not to block indexing.

The most common misconception is that robots.txt hides a page from search results. Google makes it clear that "robots.txt is not a mechanism for keeping a web page out of Google." If another site links to a blocked URL, that URL can still surface in search results without a description even though crawling is disallowed. Keeping a page out of the index is therefore the job of noindex or password protection, while robots.txt is a separate tool that governs access at the crawl stage.

How It Works

Before crawling a site, a crawler first reads the robots.txt at the root, finds the group of rules that matches its own name (User-agent), and follows them. Per RFC 9309, the file must live at the service's root level, is encoded in UTF-8, and uses the media type text/plain. It cannot be placed in a subdirectory, and there is only one per site.

When path rules conflict, the "longest match" principle applies, so the more specific rule (the one with more characters) takes precedence. robots.txt is also not strictly enforceable: crawlers may interpret it differently, and malicious bots can ignore it.

Core Directives

Directive	Role	Required
User-agent	Specifies the name of the crawler the rules apply to. * applies to any crawler not named explicitly	Required
Disallow	Specifies a path to block from crawling. It is relative to the root and starts with /	At least one Allow or Disallow per group
Allow	Specifies a path to permit as an exception within a blocked area	At least one Allow or Disallow per group
Sitemap	Announces the location of a sitemap. Must use a full, absolute URL	Optional (multiple allowed)

Code Example

# Googlebot에게만 적용되는 규칙
User-agent: Googlebot
Disallow: /nogooglebot/

# 그 외 모든 크롤러에게 적용
User-agent: *
Allow: /

# 사이트맵 위치(완전한 절대 URL)
Sitemap: https://www.example.com/sitemap.xml

Paths start with /, directories end with /, and path matching is case-sensitive. The format supports * (wildcard) and $ (end-of-pattern marker), and lines starting with # are treated as comments. Any path you do not explicitly block is allowed by default.

Common Misconceptions

Mistaken for blocking indexing — robots.txt only prevents crawling; it does not guarantee removal from the index. Google notes that a blocked URL can still appear in search results if external links point to it. Use noindex to exclude a page from the index.
Mistaken for a security measure — RFC 9309 states that "these rules are not a form of access authorization." Sensitive information should be handled with server-side password protection.
Using noindex and robots.txt together — if crawling is blocked, the crawler cannot read the page's noindex tag at all, so to drop a page from the index with noindex you must leave that page's crawling unblocked.

robots.txt

Definition

How It Works

Core Directives

Code Example

Common Misconceptions

References and Sources

Related terms