
On this page
Web scraping is the automated extraction of structured data from websites. A scraper sends HTTP requests to a target URL, parses the HTML or JSON response, and pulls out specific fields — prices, titles, ratings, addresses — into a database, spreadsheet, or downstream pipeline. It is how price monitors, search engines, and AI training datasets collect information from the open web without a human copying and pasting.
Quick facts
| Also known as | Web harvesting, web data extraction, screen scraping |
|---|---|
| Common languages | Python, JavaScript/Node, Go |
| Primary use cases | Price monitoring, lead generation, SEO research, AI training data |
| Common blockers | Rate limiting, CAPTCHAs, IP bans, JS-rendered content |
Code example
import requests
from bs4 import BeautifulSoup
# Fetch the page
resp = requests.get('https://example.com/products')
resp.raise_for_status()
# Parse the HTML and pull structured data out of it
soup = BeautifulSoup(resp.text, 'html.parser')
for card in soup.select('.product-card'):
name = card.select_one('.title').get_text(strip=True)
price = card.select_one('.price').get_text(strip=True)
print(name, price)Related terms
Concept map
How Web Scraping connects
The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.
Frequently asked questions
Is web scraping legal?
Scraping publicly accessible data without bypassing authentication is legal in most jurisdictions. Scraping personal data, copyrighted content, or sites that explicitly forbid it in enforceable terms is where legal risk concentrates. Treat robots.txt as the floor, not the ceiling, of what to consider.
What's the difference between web scraping and an API?
An official API is a contract: the site exposes specific endpoints in a documented format. Scraping reads the same data from the HTML the site renders for human users. APIs are more stable and polite, but most sites don't expose one, so scraping fills the gap.
Do I need to know how to code to scrape websites?
For one-off jobs, no-code tools like Octoparse or browser extensions can work. For anything recurring, dynamic, or at scale, you'll need Python or JavaScript — and most production scraping is written in code.
What blocks most scrapers?
In order: IP-based rate limiting, CAPTCHAs and bot challenges (especially Cloudflare and DataDome), browser fingerprinting, and layout changes that break the parser. The first three are infrastructure problems; the last is a maintenance problem.
Last updated: 2026-05-28