Robots.txt
Robots.txt is a simple text file at the root of a website that tells crawlers and bots which pages they may or may not access. It is a voluntary set of rules, not a technical lock.
Definition
Robots.txt is a plain-text file placed at a site's root (for example example.com/robots.txt) that follows the Robots Exclusion Protocol. It gives instructions to web crawlers about which URLs they are allowed or disallowed to visit.
How robots.txt works
Before crawling, well-behaved bots fetch this file and read directives such as User-agent, Disallow, Allow, and Crawl-delay. For example, Disallow: /admin/ asks crawlers to avoid the admin folder. The file may also list a Sitemap URL.
Why it matters for web scraping
- It is advisory, not enforced: it does not technically block access, but ignoring it can be considered abusive and may trigger blocking or legal scrutiny.
- Ethical scrapers honor robots.txt and any
Crawl-delayto avoid overloading servers. - It reveals site structure and hidden paths a site owner prefers to keep out of search engines.
Treating robots.txt as a baseline for responsible crawling helps keep scraping projects both ethical and sustainable.
Examples
A site using Disallow: /private/ to keep folders out of search engines
Googlebot reading robots.txt before indexing a website
A robots.txt file pointing crawlers to the XML sitemap
Common Use Cases
Frequently Asked Questions
Keep Learning
All termsWeb Scraping
Web scraping is the automated extraction of data from websites — fetching pages programmatically and parsing their content into structured data.
Read definitionIP Rotation
IP rotation is the practice of automatically cycling through multiple IP addresses so that successive requests originate from different IPs.
Read definitionUser Agent
A user agent is the identifying string a browser sends with every request, telling the server which browser, version and operating system you are using.
Read definitionRate Limiting
Rate limiting restricts how many requests a client can make in a given time, and it is one of the most common defenses scrapers must work around.
Read definition