What real estate data lives in a listing page
The data you want is almost never in the raw HTML you get from a plain HTTP request; it is loaded by JavaScript after the page renders. Most portals are single-page apps (SPAs) built on frameworks like Next.js, so the listing details ride along in an embedded JSON blob inside a <script> tag. Common patterns include a __NEXT_DATA__ block (Next.js sites embed the full props tree there), a window.PAGE_MODEL object (used by some UK portals), or an internal JSON endpoint the front-end calls (Redfin's /stingray/api/gis returns listing JSON, oddly prefixed with {}&&& that you strip before parsing).
Once you reach that JSON you typically find the fields buyers care about: list price, bedrooms and bathrooms, living area in square feet or square meters, lot size, year built, property type, listing status (active, pending, sold), days on market, price history, latitude/longitude, agent and brokerage, photos, and the description text. Expect to normalize: prices arrive as "$450,000" or "450K" and need parsing to an integer, beds show as "3 bd" or "3", and addresses come in inconsistent formats. Add your own scraped_at, first_seen, and last_seen timestamps so you can compute days-on-market drift and price changes over time.
Why these portals are hard to scrape
Three things make property portals harder than an average website. First, JavaScript rendering: a basic HTTP client sees an empty shell, so you either run a headless browser (Playwright, Puppeteer, Selenium) or reverse-engineer the embedded JSON and internal APIs. Second, anti-bot defenses: large portals commonly sit behind systems such as Akamai Bot Manager or Imperva, which fingerprint your TLS handshake, browser, and JavaScript execution, and they block data-center IP ranges almost on sight. Third, geo-gating: prices, currency, available inventory, and even which listings appear change based on the country and sometimes the ZIP code inferred from your IP, so scraping the wrong region quietly gives you the wrong numbers.
The practical implication is that you usually need real-browser-like requests plus residential proxies (IP addresses from real home connections) in the target country, with sensible request spacing per IP. Difficulty varies a lot between portals, so benchmark each target individually rather than assuming one recipe works everywhere.
DIY tooling vs a managed API
If you target one or two lighter portals and can tolerate some maintenance, a DIY stack is often the most economical and gives you full control. A typical setup pairs a headless browser or an HTTP client like curl_cffi/httpx with a residential proxy provider (Bright Data, Oxylabs, Smartproxy) and your own retry and parsing logic; Scrapy or Playwright handle the crawl orchestration well. The cost is that you own the proxy rotation, fingerprint upkeep, challenge handling, and the 3am fix when a portal ships a new Next.js build and your JSON path breaks.
A managed scraping API wins when you have continuous, multi-portal needs across regions, or when your targets redesign often and you want the pipeline to keep running without babysitting. The trade-off is less low-level control and a per-request cost, and many APIs still hand back raw HTML you must parse yourself, so weigh output format alongside success rate. Managed services such as Scrapfly, ZenRows, Bright Data, and Scrappey roll JavaScript rendering, residential geo-targeting, anti-bot handling, and retries into a single call, so you spend your time on what to collect rather than how to keep collectors alive. Pick based on scope, budget, and how much infrastructure you want to own.
