GlossaryWeb ScrapingBeginner

Robots.txt

Robots.txt is a simple text file at the root of a website that tells crawlers and bots which pages they may or may not access. It is a voluntary set of rules, not a technical lock.

Last updated June 8, 2026

Definition

Robots.txt is a plain-text file placed at a site's root (for example example.com/robots.txt) that follows the Robots Exclusion Protocol. It gives instructions to web crawlers about which URLs they are allowed or disallowed to visit.

How robots.txt works

Before crawling, well-behaved bots fetch this file and read directives such as User-agent, Disallow, Allow, and Crawl-delay. For example, Disallow: /admin/ asks crawlers to avoid the admin folder. The file may also list a Sitemap URL.

Why it matters for web scraping

It is advisory, not enforced: it does not technically block access, but ignoring it can be considered abusive and may trigger blocking or legal scrutiny.
Ethical scrapers honor robots.txt and any Crawl-delay to avoid overloading servers.
It reveals site structure and hidden paths a site owner prefers to keep out of search engines.

Treating robots.txt as a baseline for responsible crawling helps keep scraping projects both ethical and sustainable.

Examples

A site using Disallow: /private/ to keep folders out of search engines

Googlebot reading robots.txt before indexing a website

A robots.txt file pointing crawlers to the XML sitemap

Common Use Cases

Checking which areas a site asks crawlers to avoid

Setting a respectful crawl-delay for ethical scraping

Discovering a sites sitemap and structure

Building compliant, well-behaved web crawlers

Frequently Asked Questions

No. It is a voluntary standard, not a technical or legal barrier, but ignoring it can be viewed as abusive and may lead to blocks or disputes.

Nothing technically prevents access, but you risk IP bans, server strain, and reputational or legal consequences, so responsible scrapers honor it.

Always at the root of a domain, such as https://example.com/robots.txt. Crawlers fetch it from there before visiting other pages.

Keep Learning

All terms

Web Scraping

Web scraping is the automated extraction of data from websites — fetching pages programmatically and parsing their content into structured data.

Read definition

IP Rotation

IP rotation is the practice of automatically cycling through multiple IP addresses so that successive requests originate from different IPs.

Read definition

User Agent

A user agent is the identifying string a browser sends with every request, telling the server which browser, version and operating system you are using.

Read definition

Rate Limiting

Rate limiting restricts how many requests a client can make in a given time, and it is one of the most common defenses scrapers must work around.

Read definition

Back to Glossary

Robots.txt

Definition

How robots.txt works

Why it matters for web scraping

Examples

Common Use Cases

Frequently Asked Questions

Keep Learning

Web Scraping

IP Rotation

User Agent

Rate Limiting

Company

Legal