Why raw HTML is not training data
Training on raw HTML hurts model quality — boilerplate (nav, footer, related-articles widgets) shows up in fine-tuned outputs as off-topic noise, and the same boilerplate repeated across thousands of pages teaches the model to overweight it. A training-grade scraper runs main-content extraction (readability-style algorithms or LLM-based), strips boilerplate, preserves code and tables, and outputs markdown that reads as cleanly as the original article.
Dedupe and quality filtering
Web crawls produce massive duplication — same article on the original site, AMP version, syndicated mirrors, archive.org copies. A good API exposes a stable content hash so your pipeline can dedupe before training. License filtering matters too: respect robots.txt and ai.txt directives, capture canonical URLs, and surface Creative Commons vs all-rights-reserved metadata so legal can audit the dataset later.
Scale and idempotency
Training datasets are millions of URLs. The API has to handle: idempotent retries (same URL → same hash), dead-link tracking (do not re-scrape 410s forever), proxy rotation at scale, and backpressure when downstream pipelines stall. Throughput targets in the 1,000-10,000 URLs/minute range are achievable with a managed API; building this in-house is months of engineering before the first useful dataset lands.
