Coverage across boards and career pages
Job data lives in three places, and a good API has to reach all of them. Large aggregators and job boards render most listings in the browser with JavaScript, so a plain HTTP fetch returns an empty shell — you need browser rendering (the API runs the page's scripts so the listings actually appear). Company career pages are split between simple server-rendered HTML and applicant-tracking-system (ATS) widgets like Greenhouse, Lever, Workable, and Ashby, which are embedded apps that build their content client-side. Many ATS platforms expose a clean public JSON endpoint per company (for example a Greenhouse board's /embed/job_board?for=COMPANY feed), and hitting that directly is faster and more stable than scraping the rendered page. The practical rule: prefer a structured feed when one exists, fall back to a rendered page when it does not, and pick an API that does both behind one call rather than forcing you to self-host browsers and rotate residential proxies yourself.
Pagination, infinite scroll, and field extraction
Capturing a full board means iterating every page, not just the first. Classic boards page through a URL parameter — an offset like start=10 or a page=2 query — so you loop until a page returns no new listings. Modern boards use infinite scroll, where more results load as you scroll down; under the hood that is almost always a background XHR/fetch call to a JSON API, and replaying that request with the next offset or cursor is far cleaner than simulating scroll in a headless browser. (See how to scrape infinite-scroll pages for the network-tab approach.) For extraction, check for JSON-LD JobPosting markup first: Google-for-Jobs eligibility pushes many sites to embed a <script type="application/ld+json"> block with title, hiringOrganization, jobLocation, baseSalary (a MonetaryAmount with currency and a min/max QuantitativeValue), datePosted, validThrough, and jobLocationType set to TELECOMMUTE for remote roles. That structured block is more stable than CSS selectors that break on every redesign; treat DOM parsing or regex over embedded JS objects as the fallback when the markup is absent.
Deduplication, normalization, and DIY vs managed
The same job appears on five sites, so dedup is the part that actually makes the dataset usable. Use the source's native job ID when it exposes one; when it does not, synthesize a stable key by hashing normalized fields — lowercased title, canonical company name, and a coarse location (city or remote) — so the same posting collapses to one record across runs and across boards. Normalize before you hash: salary strings ("$120k-150k", "120,000 - 150,000 USD") become a numeric min/max plus a currency code, location free-text maps to a city/region/remote enum, and posted dates become ISO timestamps. On the build question, a DIY stack (requests or httpx plus Playwright for rendering, BeautifulSoup for parsing) gives you full control and is cheap at small scale, but you own proxy rotation, anti-bot handling, headless-browser upkeep, and retries. A managed web-data API like Scrappey folds proxies, JavaScript rendering, session management, and retries into a single call, which is usually the better trade once you are watching more than a handful of boards on a schedule. Whichever path you take, scrape only public listings and follow each site's Terms of Service and robots directives.
