GlossaryWeb ScrapingBeginner

Robots.txt

Robots.txt is a simple text file at the root of a website that tells crawlers and bots which pages they may or may not access. It is a voluntary set of rules, not a technical lock.

Last updated June 8, 2026

Definition

Robots.txt is a plain-text file placed at a site's root (for example example.com/robots.txt) that follows the Robots Exclusion Protocol. It gives instructions to web crawlers about which URLs they are allowed or disallowed to visit.

How robots.txt works

Before crawling, well-behaved bots fetch this file and read directives such as User-agent, Disallow, Allow, and Crawl-delay. For example, Disallow: /admin/ asks crawlers to avoid the admin folder. The file may also list a Sitemap URL.

Why it matters for web scraping

  • It is advisory, not enforced: it does not technically block access, but ignoring it can be considered abusive and may trigger blocking or legal scrutiny.
  • Ethical scrapers honor robots.txt and any Crawl-delay to avoid overloading servers.
  • It reveals site structure and hidden paths a site owner prefers to keep out of search engines.

Treating robots.txt as a baseline for responsible crawling helps keep scraping projects both ethical and sustainable.

Examples

1

A site using Disallow: /private/ to keep folders out of search engines

2

Googlebot reading robots.txt before indexing a website

3

A robots.txt file pointing crawlers to the XML sitemap

Common Use Cases

Checking which areas a site asks crawlers to avoid
Setting a respectful crawl-delay for ethical scraping
Discovering a sites sitemap and structure
Building compliant, well-behaved web crawlers

Frequently Asked Questions

No. It is a voluntary standard, not a technical or legal barrier, but ignoring it can be viewed as abusive and may lead to blocks or disputes.
Nothing technically prevents access, but you risk IP bans, server strain, and reputational or legal consequences, so responsible scrapers honor it.
Always at the root of a domain, such as https://example.com/robots.txt. Crawlers fetch it from there before visiting other pages.