What the core fields are and where they live
News monitoring lives or dies on getting five fields right for every article: headline, body text, author (byline), publish date, and source. The most reliable place to read them is structured markup the publisher already embeds. Most news sites ship a NewsArticle or Article block in JSON-LD (a <script type="application/ld+json"> tag holding Schema.org data) that exposes headline, author, datePublished, and dateModified directly. Open Graph and meta tags add fallbacks like article:published_time, article:modified_time, and og:site_name. Reading those is far more accurate than scraping visible text, where a date next to a headline might be 'updated 2h ago' rather than the original publish time.
For the body, the hard part is boilerplate: navigation, related-article rails, newsletter prompts, and comment widgets surround the actual story. Readability-style extraction (the same idea behind a browser's reader view) isolates the main content. Open-source helpers like extruct and metascraper pull the structured metadata, and a managed scraping API typically layers a content-extraction model on top so one request returns headline, byline, date, and clean body together rather than raw HTML you have to parse yourself.
RSS and news sitemaps vs full scraping
Start with feeds before you reach for a full crawl. RSS and Atom feeds are XML lists that publishers update as they post, and they hand you title, link, author, and publish date with no JavaScript and almost no parsing. Google News and most outlets also publish a news sitemap (an XML file listing recent URLs with <news:publication_date>), which is the cleanest way to discover what is new. Polling a feed or sitemap on a schedule and deduplicating by URL hash lets you forward only net-new stories downstream, which is cheap and fast.
Feeds have real limits, though. Many carry only a summary or the first paragraph rather than the full body, some outlets do not publish a feed at all, and fields vary from one site to the next. That is where scraping earns its place: you use the feed or sitemap to discover URLs, then scrape each article page to extract the complete body and richer metadata. The honest tradeoff is that RSS wins on simplicity and politeness, while scraping wins on completeness and coverage of sites that expose little or nothing in a feed. A robust monitor uses both: feeds for discovery, scraping for depth.
Freshness, scale, and clean output for LLMs
Freshness is a polling problem. Breaking-news desks may poll high-priority sources every few minutes, while a long-tail outlet can be checked hourly; an adaptive scheduler keeps different intervals per source so you spend requests where news actually moves. Deduplication matters at scale because a single wire story (AP, Reuters) is republished verbatim across many sites, so a content hash on the normalized body lets you collapse near-identical copies before they reach storage. Conditional requests with ETag and If-Modified-Since headers cut wasted fetches on pages that have not changed.
If the destination is a RAG or LLM pipeline, the output format is the deliverable. Markdown that preserves headings, lists, and quotes chunks and embeds far more cleanly than raw HTML, which is why AI-focused tools like Firecrawl and Crawl4AI default to markdown output; Crawl4AI even ships a BM25 content filter that keeps only sections matching your query terms. Crawl4AI is open source and runs on your own servers, which is the right call when you want full control and no per-request cost. Firecrawl is a hosted API that returns LLM-ready markdown with nothing to operate. A managed web scraping API such as Scrappey covers the in-between layer: it handles proxy rotation, JavaScript rendering, retries, and a markdown flag in a single call, so a monitor that has to reach hundreds of differently built news sites does not need its own browser fleet.
