Web Scraping with Python in 2026: Step-by-Step Tutorial

Practical step-by-step Python web scraping tutorial for 2026 — requests, BeautifulSoup, Playwright, Scrapy, proxies, retries, and production tips.

Lokesh Kapoor
May 27, 2026
12 min read

Python powers an estimated 70% of all web scraping pipelines in production in 2026 — from indie developers pulling competitor prices to enterprise teams building LLM training corpora. The reason is simple: a working scraper in Python is roughly 10 lines of code, the ecosystem is mature, and every modern scraping concept (proxy rotation, JS rendering, retries, async) has first-class support in the standard libraries.

The trouble is that the path from "10 lines of code" to "production scraper that does not break weekly" runs through a dozen subtle decisions — which library, how to parse, where to retry, how to handle pagination, when to switch to a real browser, where to add proxies. Get any one of them wrong and the pipeline silently returns empty data for a week before anyone notices.

This guide is the practical step-by-step Python web scraping tutorial for 2026 — real code samples for requests, BeautifulSoup, Playwright, and Scrapy, plus how to add proxies, retries, and rate limiting before shipping. For broader context, see our companion guides on what proxies are and scaling web scraping in 2026.

What You Will Need

Three things to get started: Python 3.9+, a code editor (VS Code or PyCharm both work), and either a virtual environment or a fresh project directory. The libraries used in this tutorial install with a single command — no Docker, no system-level dependencies, no API keys for the basic examples.

pip install requests beautifulsoup4 httpx playwright scrapy
playwright install chromium  # downloads the headless Chrome binary

For the proxy examples later in the post, you will need credentials from any residential proxy provider — most ship a free trial or free tier that covers the tutorial volume. We use environment variables (PROXY_USER, PROXY_PASS) so credentials never land in your script directly.

The Python Web Scraping Stack in 2026

Five libraries cover roughly 95% of Python scraping workloads. Match the tool to your use case before writing the first line of code — picking the wrong library at the start adds rewrite cost later.

LibraryBest ForStrengthWeakness
requestsSimple HTTP, prototypesUniversal support, easySynchronous only
httpxAsync + sync HTTPHTTP/2, async, requests-compatible APISlightly heavier dependency
BeautifulSoupHTML parsingClean CSS-like selectorsJust a parser, no fetching
PlaywrightJS-heavy sites, SPAsReal browser, full JS executionSlower, higher resource cost
ScrapyProduction-scale pipelinesBuilt-in async, retries, pipelinesSteeper learning curve

Step-by-Step Tutorial — Build a Real Scraper

The four steps below take a brand-new project from "empty file" to a working scraper with pagination, retries, and rate limiting. Each step builds directly on the previous one — copy, run, then layer on the next piece.

Step 1 — Make Your First Request

Use a real public scraping practice site (books.toscrape.com) so you can run this code without modification. The pattern is the same for any target.

import requests
from bs4 import BeautifulSoup

url = "https://books.toscrape.com/catalogue/page-1.html"
headers = {"User-Agent": "Mozilla/5.0 (compatible; tutorial-scraper/1.0)"}

resp = requests.get(url, headers=headers, timeout=10)
soup = BeautifulSoup(resp.text, "html.parser")

for book in soup.select("article.product_pod"):
    title = book.h3.a["title"]
    price = book.select_one(".price_color").get_text(strip=True)
    print(f"{title} — {price}")

Step 2 — Parse Structured Data With BeautifulSoup

The same selector pattern works for any HTML structure. Use a list comprehension to build dictionaries you can pipe straight into pandas, a database, or JSON.

books = [
    {
        "title": book.h3.a["title"],
        "price": book.select_one(".price_color").get_text(strip=True),
        "rating": book.select_one("p.star-rating")["class"][1],
        "url": book.h3.a["href"],
    }
    for book in soup.select("article.product_pod")
]
print(f"Scraped {len(books)} books")

Step 3 — Handle Pagination

Most real targets paginate. Loop with a polite delay between pages and break gracefully when the server returns a non-200 status. time.sleep(1) is the simplest rate limiter — sophisticated alternatives come in Step 4.

import time

all_books = []
for page in range(1, 51):
    url = f"https://books.toscrape.com/catalogue/page-{page}.html"
    resp = requests.get(url, headers=headers, timeout=10)
    if resp.status_code != 200:
        print(f"Stopping at page {page} (status {resp.status_code})")
        break
    soup = BeautifulSoup(resp.text, "html.parser")
    all_books.extend([b for b in soup.select("article.product_pod")])
    time.sleep(1)
print(f"Total books across all pages: {len(all_books)}")

Step 4 — Add Retries and Rate Limiting

Production scrapers need automatic retries on transient failures (429, 5xx) and exponential backoff to avoid amplifying load during outages. The requests.Session object plus an HTTPAdapter does this in five lines.

from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

retry = Retry(
    total=3,
    backoff_factor=2,
    status_forcelist=[429, 500, 502, 503, 504],
    allowed_methods=["GET"],
)
session = requests.Session()
session.mount("https://", HTTPAdapter(max_retries=retry))
session.headers.update(headers)

resp = session.get("https://books.toscrape.com/", timeout=15)

Scaling Up — When to Switch Libraries

The four-step pattern above handles most low-friction scraping. Three scenarios push you into more capable tools.

JavaScript-rendered sites (SPAs, React apps, lazy-loaded grids) return empty HTML to requests because the data loads via JS after the initial response. Switch to Playwright for these targets — it spins up a real Chromium instance, executes JS, and returns the fully hydrated DOM.

High-volume async scraping (10,000+ URLs in a single run) benefits from async I/O. Switch to httpx with asyncio for 5–10× throughput on the same hardware, since most scraping time is spent waiting on network responses. A bare-bones async fetcher is under 10 lines:

import asyncio
import httpx

async def fetch(client, url):
    resp = await client.get(url, timeout=15)
    return resp.text

async def main(urls):
    async with httpx.AsyncClient() as client:
        return await asyncio.gather(*[fetch(client, u) for u in urls])

results = asyncio.run(main(url_list))

Production-scale pipelines (millions of pages, multiple workers, persistent queues) graduate to Scrapy. It ships with async I/O, automatic retries, pipelines for data validation and storage, middlewares for proxy rotation and user-agent rotation, and a CLI for scheduled runs — all the production scaffolding you would otherwise build by hand.

Adding a Proxy to Your Python Scraper

For anything past prototyping against polite targets, you need a proxy. A single static IP scraping hundreds of requests per minute gets flagged within hours on most modern sites. The requests integration is one line.

import os

proxies = {
    "http": f"http://{os.environ['PROXY_USER']}:{os.environ['PROXY_PASS']}@gate.provider.com:7000",
    "https": f"http://{os.environ['PROXY_USER']}:{os.environ['PROXY_PASS']}@gate.provider.com:7000",
}

resp = session.get(url, proxies=proxies, timeout=15)

For Playwright, pass the proxy in the launch options instead of per request. For Scrapy, configure a single proxy middleware that injects the URL into every request automatically. The same proxy URL works across all three setups — picking a provider that supports the protocol your tool needs is the only configuration choice that matters here.

The four providers below ship clean Python integration (REST APIs, official SDKs, copy-paste documentation) and cover the full price/feature spectrum for Python scrapers in 2026.

1. BrightData

Loading Proxy...

BrightData runs the deepest scraping stack with 72M+ residential IPs across 195 countries, official Python SDK, and a Web Unlocker API that handles JA3 spoofing and CAPTCHA bypass server-side. Drop the proxy URL into requests or wrap the Unlocker endpoint in httpx — the response comes back ready for BeautifulSoup. The default enterprise choice.

2. Oxylabs

Loading Proxy...

Oxylabs ships the cleanest Python documentation on the market alongside 102M+ IPs at 99.99% uptime. The dedicated oxylabs Python SDK wraps proxy, SERP, and e-commerce scraping behind typed clients with built-in retries. Compliance audits and SOC 2 certification make it the safe pick for finance, travel, and brand-protection pipelines.

3. Decodo

Loading Proxy...

Decodo is the developer-friendly value pick — 115M+ IPs at 99.99% uptime with plans from $30/month. The single-URL auth format drops into requests, httpx, Playwright, and Scrapy identically with one line of config, and sticky session control via the username parameter handles multi-step authenticated flows without extra SDK calls.

4. Geonode

Loading Proxy...

Geonode is the unlimited-bandwidth champion for high-volume Python scrapers. With 30M+ residential IPs and thread-based pricing (instead of per-GB metering), multi-terabyte training-data runs become predictable rather than budget-busting. Past 500GB monthly traffic, Geonode beats per-GB providers on TCO without sacrificing IP quality.

Common Mistakes Python Scrapers Make

Ignoring User-Agent Headers

The default python-requests/2.x User-Agent is a textbook bot fingerprint that most modern targets flag instantly. Always set a realistic browser User-Agent in your session headers, and consider rotating across a small pool of 5–10 real Chrome and Firefox versions for sustained scraping. The fix is one line of code and dramatically reduces block rates against any target with basic bot detection in place.

Skipping Rate Limiting

Running an unthrottled loop against a single target site is the fastest way to get IP-banned, regardless of proxy quality. Use time.sleep() between requests (typically 0.5–2 seconds), randomize the delay slightly to avoid mechanical timing fingerprints, and respect any Retry-After headers returned by 429 responses. Politeness costs nothing and prevents the most common cause of weekend-long debugging sessions.

Storing Credentials in Scripts

Pasting proxy USER:PASS or API keys directly into .py files exposes them in every git commit and CI log. Use environment variables (os.environ), a .env file loaded via python-dotenv, or a proper secret manager. The same applies to scraping target credentials when you log into authenticated endpoints — never hardcode, always inject at runtime.

Not Handling JavaScript-Rendered Content

Modern sites serve JavaScript shells that fetch data via XHR after page load. Running requests against these targets returns near-empty HTML and most scrapers misdiagnose the result as anti-bot blocking. Open the target site in your browser with DevTools and inspect the Network tab — if the data appears via XHR, switch to Playwright or call the XHR endpoint directly. Both are dramatically faster than guessing.

Hardcoding Selectors Without a Fallback

Target sites rewrite their HTML constantly, and a single brittle selector like div.product-name > span.title breaks the entire scraper the day they ship a redesign. Build resilient parsers by trying multiple selector strategies in sequence (CSS first, then XPath, then text-based heuristics), and treat any single-row parse failure as a soft warning rather than a hard exit. Better still, structure selectors as a config file separate from the scraping logic, so updating a broken selector is a one-line change instead of a code deploy.

Not Tracking Per-Domain Block Rates

Production scrapers hitting dozens of domains routinely have one or two targets quietly failing at 80% block rate while the rest succeed normally. Aggregate metrics hide this — overall success rate looks fine. Always log per-domain success rate, break-out by target, and alert when any single domain drops below 90%. The fix is usually a different proxy provider or a sticky-session strategy for that target, but you cannot fix what you cannot see.

Tips for Production Python Scrapers

  • Use a Session object. requests.Session reuses TCP connections across requests, cutting per-request latency by 30–50% on sustained scraping. Configure retries and headers once at session creation.
  • Cache responses by URL hash. Re-running a corpus build should not re-fetch identical pages. A simple diskcache or Redis wrapper around your session cuts proxy spend by 30–60% on iterative development.
  • Log every request with status and duration. Pipe response status, latency, and content length into structured logs. Anomalies (sudden 403 spikes, latency increases) surface immediately in any log aggregator.
  • Tag requests with job IDs. Include a job_id in your request headers or URL. When data quality dips on a downstream eval, trace which job produced which rows in seconds.
  • Validate output with Pydantic. Parse scraped fields into a Pydantic model and reject malformed rows. Bad data caught at scrape time beats bad data caught in production a week later.

Frequently Asked Questions

It depends on the workload. For simple HTTP scraping and prototypes, requests plus BeautifulSoup is the easiest combination — 10 lines of code and you have a working scraper. For async at scale, httpx with asyncio delivers 5–10× throughput. For JavaScript-rendered sites, Playwright is the right choice. For production pipelines handling millions of pages with retries, queues, and pipelines built in, Scrapy is unmatched. Most production stacks use multiple libraries together.
For anything past polite, low-volume prototyping, yes. A single static IP scraping 100+ pages per minute gets flagged or blocked within hours on most modern sites. Residential proxies (BrightData, Decodo, Oxylabs) maintain success rates of 90–98% against tough targets versus 30–60% for raw static IPs. The cost — typically $1–$5 per 1,000 successful requests — is dramatically lower than the engineering hours required to maintain a custom IP rotation layer.
Use Playwright (or its predecessor Selenium). Playwright spins up a real Chromium instance, executes JavaScript fully, and returns the hydrated DOM — exactly what a real browser sees. The Python API is clean and supports async natively. Alternatively, inspect the target site’s network traffic in browser DevTools — if the data loads via a single XHR endpoint, calling that endpoint directly is dramatically faster than running a full browser instance for each page.
They solve different problems. BeautifulSoup is an HTML parsing library — you fetch the page with requests, then parse with Beautiful Soup. Scrapy is a full scraping framework with built-in fetching, parsing, retries, queues, async, pipelines, and middlewares. For prototypes or small projects, requests plus BeautifulSoup is simpler. For production pipelines past 100K pages per run, Scrapy’s built-in infrastructure pays for itself within a week. Most teams graduate from one to the other.
Five practices reduce block rates dramatically. First, set a realistic User-Agent header. Second, throttle requests with time.sleep() between calls. Third, use residential proxies with IP rotation. Fourth, handle 429 and 5xx responses with exponential backoff retries. Fifth, randomize timing slightly to avoid mechanical fingerprinting. For tough targets that still block you, switch to a Web Unlocker API (BrightData, Oxylabs) that handles JA3 spoofing and CAPTCHA bypass server-side.
httpx is the modern successor to requests. The API is almost identical (drop-in replacement for most code), but httpx adds HTTP/2 support, native async via asyncio, and better connection pooling. For new projects, httpx is the better choice — same syntax you already know, with async available when you need scale. requests remains the most-documented library and works fine for synchronous workloads where you do not need HTTP/2 or async.
Partially. Visual tools like Octoparse and Apify provide point-and-click scraping with optional Python integration for custom logic. Low-code platforms like n8n let you build scraping workflows via the HTTP Request node without writing Python. But for any non-trivial target (pagination, multi-step authentication, JS rendering, custom parsing), Python code remains dramatically more flexible and maintainable. Most teams use Python plus a workflow tool together rather than picking one or the other.
Using Python to scrape is legal in every major jurisdiction. The legality of the data collection depends on what you scrape, where the target server lives, and how you use the data. Public-data scraping has been broadly upheld by US courts (notably hiQ v. LinkedIn). Always respect target site terms of service, avoid scraping personal data without legal basis, and consult counsel for use cases in regulated industries like finance, healthcare, or insurance.
With sync requests, a single thread typically handles 10–50 pages per second depending on target latency. Adding multithreading or async with httpx scales to 500–2,000 pages per second per worker. Scrapy in its standard configuration handles 1,000–5,000 pages per second on a single machine, with concurrent connection settings tuning the throughput further. Real-world rate is usually capped by target-site rate limits or proxy concurrency budget, not Python itself.

Conclusion: Ship a Working Scraper, Then Scale It

The fastest path to a working Python scraper in 2026 is the four-step pattern above — requests for fetching, BeautifulSoup for parsing, pagination with a polite delay, and retries via a configured session. That stack handles a meaningful share of real-world targets without touching Playwright or Scrapy. Add a residential proxy when you outgrow your local IP, switch to Playwright when JavaScript rendering blocks you, and graduate to Scrapy when production scale demands it.

For most readers, a quality residential proxy from BrightData, Oxylabs, Decodo, or Geonode covers the realistic threat model. Pair it with the production tips above (Session reuse, response caching, structured logging, Pydantic validation) and you have a scraper that runs for months without intervention.

Ready to dive deeper? Browse our full proxy directory for side-by-side comparisons, or read our companion guide on scraping e-commerce sites without bans for the next layer of the stack.