The short answer
There is no law called "the web scraping law." Web scraping is automated reading of web pages, and reading public information is not illegal. What can be illegal is a specific combination of facts around a scrape. The four questions that actually decide legality are:
- Is the data public, or behind a login / paywall you had to break through?
- Does it contain personal data about identifiable people?
- Is the content copyrighted, and are you republishing it?
- Did you agree to Terms that prohibit scraping, and did your scraper harm the site?
Get those right and most scraping of public, non-personal data sits comfortably on the legal side. Get them wrong and even "just reading a page" can turn into a contract, privacy, or copyright problem.
United States: the CFAA and "public" data
The headline US statute people worry about is the Computer Fraud and Abuse Act (CFAA), which criminalizes accessing a computer "without authorization." The key question has been whether scraping a public website counts.
In the landmark hiQ Labs case, the Ninth Circuit held that scraping data a site makes publicly available (no login required) does not violate the CFAA — there's no "authorization" to exceed when the data is open to everyone. The Supreme Court's Van Buren v. United States decision narrowed the CFAA in a compatible direction, focusing it on cases where someone reaches an access control they were not authorized to cross, rather than on violating usage policies.
The practical takeaway: public means public. Data behind a password, paywall, or technical access control is a different story — breaking through an access gate is where CFAA exposure starts. Note that hiQ ultimately lost on contract grounds (breach of the site's Terms), which is exactly why the "how" matters as much as the "what."
Personal data: GDPR and CCPA
The fact that personal data is visible on a public page does not make it free to collect and store. Under the EU/UK GDPR, processing personal data (names, emails, profiles) requires a lawful basis, and data subjects have rights regardless of where the data was found. The US CCPA/CPRA imposes similar obligations in California.
- Aggregating public personal data at scale is one of the most litigated and regulated areas of scraping.
- Non-personal data — prices, product specs, sports scores, public filings — carries far less privacy risk.
If your scrape can avoid personal data entirely, it sidesteps the single largest category of legal risk. When you do need it, document a lawful basis and minimize what you keep.
Copyright and database rights
Scraping facts (a price, a temperature, a stock level) is generally safe — facts aren't copyrightable. Copying creative expression — articles, photos, reviews, descriptions — and republishing it can infringe copyright even if you scraped it from a public page. The EU additionally recognizes a sui generis database right protecting substantial extractions from a database.
Using scraped content for analysis, indexing, or internal research tends to be lower risk than republishing it verbatim in competition with the source. When in doubt, store and transform the data rather than mirroring the original work.
Terms of Service, robots.txt and server load
Even when statutes don't bite, a site's Terms of Service can. If you clicked "I agree" or accessed an area gated by terms that ban automated collection, scraping may be a breach of contract — the ground hiQ actually lost on. Anonymous access to a fully public page is a weaker basis for a ToS claim, but the safest path is simply to respect the rules you're on notice of.
Two technical courtesies also reduce both legal and practical risk:
- Honor robots.txt and published crawl guidance.
- Rate-limit yourself. A scraper that degrades a site's service can move you from "reading public data" toward trespass-to-chattels or computer-misuse territory. Polite, paced requests matter — both legally and so you don't get 429-rate-limited or blocked.
A practical checklist for staying on the right side
None of this is legal advice, but these habits keep most scraping projects defensible:
- Scrape public data — don't break through logins, paywalls, or access controls.
- Avoid or minimize personal data; if you must collect it, have a lawful basis and a retention limit.
- Use facts, not verbatim creative content; transform rather than republish.
- Respect robots.txt and Terms you're genuinely on notice of.
- Rate-limit and identify yourself where appropriate; never disrupt the target's service.
- Check the laws of your jurisdiction and the target's — and ask a lawyer for anything high-stakes.
Tools like a managed web scraping API help on the how by pacing requests and managing infrastructure for publicly accessible data — but the legal responsibility for what you collect and how you use it always stays with you.