Web Scraping with Python in 2026: Step-by-Step Tutorial
Practical step-by-step Python web scraping tutorial for 2026 — requests, BeautifulSoup, Playwright, Scrapy, proxies, retries, and production tips.
Python powers an estimated 70% of all web scraping pipelines in production in 2026 — from indie developers pulling competitor prices to enterprise teams building LLM training corpora. The reason is simple: a working scraper in Python is roughly 10 lines of code, the ecosystem is mature, and every modern scraping concept (proxy rotation, JS rendering, retries, async) has first-class support in the standard libraries.
The trouble is that the path from "10 lines of code" to "production scraper that does not break weekly" runs through a dozen subtle decisions — which library, how to parse, where to retry, how to handle pagination, when to switch to a real browser, where to add proxies. Get any one of them wrong and the pipeline silently returns empty data for a week before anyone notices.
This guide is the practical step-by-step Python web scraping tutorial for 2026 — real code samples for requests, BeautifulSoup, Playwright, and Scrapy, plus how to add proxies, retries, and rate limiting before shipping. For broader context, see our companion guides on what proxies are and scaling web scraping in 2026.
What You Will Need
Three things to get started: Python 3.9+, a code editor (VS Code or PyCharm both work), and either a virtual environment or a fresh project directory. The libraries used in this tutorial install with a single command — no Docker, no system-level dependencies, no API keys for the basic examples.
pip install requests beautifulsoup4 httpx playwright scrapy
playwright install chromium # downloads the headless Chrome binary
For the proxy examples later in the post, you will need credentials from any residential proxy provider — most ship a free trial or free tier that covers the tutorial volume. We use environment variables (PROXY_USER, PROXY_PASS) so credentials never land in your script directly.
The Python Web Scraping Stack in 2026
Five libraries cover roughly 95% of Python scraping workloads. Match the tool to your use case before writing the first line of code — picking the wrong library at the start adds rewrite cost later.
| Library | Best For | Strength | Weakness |
|---|---|---|---|
| requests | Simple HTTP, prototypes | Universal support, easy | Synchronous only |
| httpx | Async + sync HTTP | HTTP/2, async, requests-compatible API | Slightly heavier dependency |
| BeautifulSoup | HTML parsing | Clean CSS-like selectors | Just a parser, no fetching |
| Playwright | JS-heavy sites, SPAs | Real browser, full JS execution | Slower, higher resource cost |
| Scrapy | Production-scale pipelines | Built-in async, retries, pipelines | Steeper learning curve |
Step-by-Step Tutorial — Build a Real Scraper
The four steps below take a brand-new project from "empty file" to a working scraper with pagination, retries, and rate limiting. Each step builds directly on the previous one — copy, run, then layer on the next piece.
Step 1 — Make Your First Request
Use a real public scraping practice site (books.toscrape.com) so you can run this code without modification. The pattern is the same for any target.
import requests
from bs4 import BeautifulSoup
url = "https://books.toscrape.com/catalogue/page-1.html"
headers = {"User-Agent": "Mozilla/5.0 (compatible; tutorial-scraper/1.0)"}
resp = requests.get(url, headers=headers, timeout=10)
soup = BeautifulSoup(resp.text, "html.parser")
for book in soup.select("article.product_pod"):
title = book.h3.a["title"]
price = book.select_one(".price_color").get_text(strip=True)
print(f"{title} — {price}")
Step 2 — Parse Structured Data With BeautifulSoup
The same selector pattern works for any HTML structure. Use a list comprehension to build dictionaries you can pipe straight into pandas, a database, or JSON.
books = [
{
"title": book.h3.a["title"],
"price": book.select_one(".price_color").get_text(strip=True),
"rating": book.select_one("p.star-rating")["class"][1],
"url": book.h3.a["href"],
}
for book in soup.select("article.product_pod")
]
print(f"Scraped {len(books)} books")
Step 3 — Handle Pagination
Most real targets paginate. Loop with a polite delay between pages and break gracefully when the server returns a non-200 status. time.sleep(1) is the simplest rate limiter — sophisticated alternatives come in Step 4.
import time
all_books = []
for page in range(1, 51):
url = f"https://books.toscrape.com/catalogue/page-{page}.html"
resp = requests.get(url, headers=headers, timeout=10)
if resp.status_code != 200:
print(f"Stopping at page {page} (status {resp.status_code})")
break
soup = BeautifulSoup(resp.text, "html.parser")
all_books.extend([b for b in soup.select("article.product_pod")])
time.sleep(1)
print(f"Total books across all pages: {len(all_books)}")
Step 4 — Add Retries and Rate Limiting
Production scrapers need automatic retries on transient failures (429, 5xx) and exponential backoff to avoid amplifying load during outages. The requests.Session object plus an HTTPAdapter does this in five lines.
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
retry = Retry(
total=3,
backoff_factor=2,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["GET"],
)
session = requests.Session()
session.mount("https://", HTTPAdapter(max_retries=retry))
session.headers.update(headers)
resp = session.get("https://books.toscrape.com/", timeout=15)
Scaling Up — When to Switch Libraries
The four-step pattern above handles most low-friction scraping. Three scenarios push you into more capable tools.
JavaScript-rendered sites (SPAs, React apps, lazy-loaded grids) return empty HTML to requests because the data loads via JS after the initial response. Switch to Playwright for these targets — it spins up a real Chromium instance, executes JS, and returns the fully hydrated DOM.
High-volume async scraping (10,000+ URLs in a single run) benefits from async I/O. Switch to httpx with asyncio for 5–10× throughput on the same hardware, since most scraping time is spent waiting on network responses. A bare-bones async fetcher is under 10 lines:
import asyncio
import httpx
async def fetch(client, url):
resp = await client.get(url, timeout=15)
return resp.text
async def main(urls):
async with httpx.AsyncClient() as client:
return await asyncio.gather(*[fetch(client, u) for u in urls])
results = asyncio.run(main(url_list))
Production-scale pipelines (millions of pages, multiple workers, persistent queues) graduate to Scrapy. It ships with async I/O, automatic retries, pipelines for data validation and storage, middlewares for proxy rotation and user-agent rotation, and a CLI for scheduled runs — all the production scaffolding you would otherwise build by hand.
Adding a Proxy to Your Python Scraper
For anything past prototyping against polite targets, you need a proxy. A single static IP scraping hundreds of requests per minute gets flagged within hours on most modern sites. The requests integration is one line.
import os
proxies = {
"http": f"http://{os.environ['PROXY_USER']}:{os.environ['PROXY_PASS']}@gate.provider.com:7000",
"https": f"http://{os.environ['PROXY_USER']}:{os.environ['PROXY_PASS']}@gate.provider.com:7000",
}
resp = session.get(url, proxies=proxies, timeout=15)
For Playwright, pass the proxy in the launch options instead of per request. For Scrapy, configure a single proxy middleware that injects the URL into every request automatically. The same proxy URL works across all three setups — picking a provider that supports the protocol your tool needs is the only configuration choice that matters here.
Recommended Proxies for Python Web Scraping
The four providers below ship clean Python integration (REST APIs, official SDKs, copy-paste documentation) and cover the full price/feature spectrum for Python scrapers in 2026.
1. BrightData
BrightData runs the deepest scraping stack with 72M+ residential IPs across 195 countries, official Python SDK, and a Web Unlocker API that handles JA3 spoofing and CAPTCHA bypass server-side. Drop the proxy URL into requests or wrap the Unlocker endpoint in httpx — the response comes back ready for BeautifulSoup. The default enterprise choice.
2. Oxylabs
Oxylabs ships the cleanest Python documentation on the market alongside 102M+ IPs at 99.99% uptime. The dedicated oxylabs Python SDK wraps proxy, SERP, and e-commerce scraping behind typed clients with built-in retries. Compliance audits and SOC 2 certification make it the safe pick for finance, travel, and brand-protection pipelines.
3. Decodo
Decodo is the developer-friendly value pick — 115M+ IPs at 99.99% uptime with plans from $30/month. The single-URL auth format drops into requests, httpx, Playwright, and Scrapy identically with one line of config, and sticky session control via the username parameter handles multi-step authenticated flows without extra SDK calls.
4. Geonode
Geonode is the unlimited-bandwidth champion for high-volume Python scrapers. With 30M+ residential IPs and thread-based pricing (instead of per-GB metering), multi-terabyte training-data runs become predictable rather than budget-busting. Past 500GB monthly traffic, Geonode beats per-GB providers on TCO without sacrificing IP quality.
Common Mistakes Python Scrapers Make
Ignoring User-Agent Headers
The default python-requests/2.x User-Agent is a textbook bot fingerprint that most modern targets flag instantly. Always set a realistic browser User-Agent in your session headers, and consider rotating across a small pool of 5–10 real Chrome and Firefox versions for sustained scraping. The fix is one line of code and dramatically reduces block rates against any target with basic bot detection in place.
Skipping Rate Limiting
Running an unthrottled loop against a single target site is the fastest way to get IP-banned, regardless of proxy quality. Use time.sleep() between requests (typically 0.5–2 seconds), randomize the delay slightly to avoid mechanical timing fingerprints, and respect any Retry-After headers returned by 429 responses. Politeness costs nothing and prevents the most common cause of weekend-long debugging sessions.
Storing Credentials in Scripts
Pasting proxy USER:PASS or API keys directly into .py files exposes them in every git commit and CI log. Use environment variables (os.environ), a .env file loaded via python-dotenv, or a proper secret manager. The same applies to scraping target credentials when you log into authenticated endpoints — never hardcode, always inject at runtime.
Not Handling JavaScript-Rendered Content
Modern sites serve JavaScript shells that fetch data via XHR after page load. Running requests against these targets returns near-empty HTML and most scrapers misdiagnose the result as anti-bot blocking. Open the target site in your browser with DevTools and inspect the Network tab — if the data appears via XHR, switch to Playwright or call the XHR endpoint directly. Both are dramatically faster than guessing.
Hardcoding Selectors Without a Fallback
Target sites rewrite their HTML constantly, and a single brittle selector like div.product-name > span.title breaks the entire scraper the day they ship a redesign. Build resilient parsers by trying multiple selector strategies in sequence (CSS first, then XPath, then text-based heuristics), and treat any single-row parse failure as a soft warning rather than a hard exit. Better still, structure selectors as a config file separate from the scraping logic, so updating a broken selector is a one-line change instead of a code deploy.
Not Tracking Per-Domain Block Rates
Production scrapers hitting dozens of domains routinely have one or two targets quietly failing at 80% block rate while the rest succeed normally. Aggregate metrics hide this — overall success rate looks fine. Always log per-domain success rate, break-out by target, and alert when any single domain drops below 90%. The fix is usually a different proxy provider or a sticky-session strategy for that target, but you cannot fix what you cannot see.
Tips for Production Python Scrapers
- Use a Session object.
requests.Sessionreuses TCP connections across requests, cutting per-request latency by 30–50% on sustained scraping. Configure retries and headers once at session creation. - Cache responses by URL hash. Re-running a corpus build should not re-fetch identical pages. A simple
diskcacheor Redis wrapper around your session cuts proxy spend by 30–60% on iterative development. - Log every request with status and duration. Pipe response status, latency, and content length into structured logs. Anomalies (sudden 403 spikes, latency increases) surface immediately in any log aggregator.
- Tag requests with job IDs. Include a job_id in your request headers or URL. When data quality dips on a downstream eval, trace which job produced which rows in seconds.
- Validate output with Pydantic. Parse scraped fields into a Pydantic model and reject malformed rows. Bad data caught at scrape time beats bad data caught in production a week later.
Frequently Asked Questions
Conclusion: Ship a Working Scraper, Then Scale It
The fastest path to a working Python scraper in 2026 is the four-step pattern above — requests for fetching, BeautifulSoup for parsing, pagination with a polite delay, and retries via a configured session. That stack handles a meaningful share of real-world targets without touching Playwright or Scrapy. Add a residential proxy when you outgrow your local IP, switch to Playwright when JavaScript rendering blocks you, and graduate to Scrapy when production scale demands it.
For most readers, a quality residential proxy from BrightData, Oxylabs, Decodo, or Geonode covers the realistic threat model. Pair it with the production tips above (Session reuse, response caching, structured logging, Pydantic validation) and you have a scraper that runs for months without intervention.
Ready to dive deeper? Browse our full proxy directory for side-by-side comparisons, or read our companion guide on scraping e-commerce sites without bans for the next layer of the stack.