Understanding Dynamic Content
"Dynamic" means the content shows up only after JavaScript runs in the browser, not in the raw HTML the server first sends. It arrives through one of these common patterns:
1. Types of Dynamic Loading
- AJAX requests (background calls that fetch data without reloading the page)
- Infinite scroll (more items load as you scroll down)
- Lazy loading (content loads only when it scrolls into view)
- WebSocket updates (a live connection that streams new data)
- React/Vue.js state changes (the framework re-renders the page in place)
Solution Approaches
The reliable fix is to run a real browser that executes the JavaScript, then read the page once it has rendered. Below are three approaches.
1. Using Selenium
Selenium drives a real Chrome browser. Here it scrolls to the bottom repeatedly until the page stops growing, which is how you exhaust an infinite-scroll feed before reading the items.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
class DynamicScraper:
def __init__(self):
self.driver = webdriver.Chrome()
self.wait = WebDriverWait(self.driver, 10)
def scrape_infinite_scroll(self, url, scroll_pause=2):
self.driver.get(url)
last_height = self.driver.execute_script('return document.body.scrollHeight')
while True:
# Scroll down
self.driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
# Wait for new content
time.sleep(scroll_pause)
# Calculate new scroll height
new_height = self.driver.execute_script('return document.body.scrollHeight')
# Break if no more content
if new_height == last_height:
break
last_height = new_height
# Extract content
elements = self.driver.find_elements(By.CSS_SELECTOR, '.content-item')
return [elem.text for elem in elements]
2. Using Playwright
Playwright is a newer, faster browser-automation tool. The key trick is waiting: networkidle means "wait until network traffic settles" and wait_for_selector means "wait until this element actually exists" — both ensure the dynamic content has arrived before you read it. A SPA (single-page app) is a site like a React/Vue app that renders everything with JavaScript.
from playwright.sync_api import sync_playwright
class ModernScraper:
def __init__(self):
self.playwright = sync_playwright().start()
self.browser = self.playwright.chromium.launch()
async def scrape_spa(self, url):
page = self.browser.new_page()
# Navigate and wait for network idle
await page.goto(url, wait_until='networkidle')
# Wait for specific content
await page.wait_for_selector('.dynamic-content')
# Extract data
data = await page.evaluate('''
() => {
const items = document.querySelectorAll('.item');
return Array.from(items).map(item => ({
title: item.querySelector('.title').innerText,
description: item.querySelector('.desc').innerText
}));
}
''')
return data
3. Intercepting AJAX Requests
Instead of reading rendered HTML, you can capture the raw API responses the page fetches in the background. Here a proxy (mitmproxy, an HTTP proxy that sits between browser and server) watches traffic and saves any JSON coming back from an api URL — often the cleanest source of the data.
from mitmproxy import ctx
class AjaxInterceptor:
def __init__(self):
self.data = []
def request(self, flow):
# Add custom headers
flow.request.headers['X-Requested-With'] = 'XMLHttpRequest'
def response(self, flow):
# Capture API responses
if 'api' in flow.request.pretty_url:
try:
self.data.append(json.loads(flow.response.content))
except json.JSONDecodeError:
pass
# Usage with Selenium
proxy = {
'http': 'http://localhost:8080',
'https': 'http://localhost:8080'
}
options = webdriver.ChromeOptions()
options.add_argument('--proxy-server=localhost:8080')
driver = webdriver.Chrome(options=options)
Best Practices
1. Handling Loading States
The most common bug is reading the page too early. Wait for the network to go quiet, wait for the loading spinner to disappear, then confirm the real content has appeared — in that order.
class LoadingHandler:
def wait_for_load(self, page):
# Wait for network idle
page.wait_for_load_state('networkidle')
# Check loading indicators
try:
page.wait_for_selector('.loading-spinner', state='hidden')
except TimeoutError:
pass
# Ensure content is ready
page.wait_for_selector('.content-loaded')
2. Error Recovery
Dynamic pages are flaky, so expect failures. Return None instead of crashing when an element never shows up, and retry transient errors with exponential backoff — each retry waits longer (2, 4, 8 seconds) so you do not hammer the site.
class ResilientScraper:
def safe_extract(self, page, selector, timeout=5000):
try:
element = page.wait_for_selector(selector, timeout=timeout)
return element.text_content()
except TimeoutError:
logger.warning(f'Element {selector} not found')
return None
async def retry_action(self, action, max_retries=3):
for attempt in range(max_retries):
try:
return await action()
except Exception as e:
if attempt == max_retries - 1:
raise
await asyncio.sleep(2 ** attempt)
Remember: Dynamic content scraping requires patience and proper waiting mechanisms. Always respect the website's resources and implement appropriate delays.
