Python Web Scraping

How to extract data from websites using Selenium Python? (2026 Guide)

How to extract data from websites using Selenium Python? (2026 Guide) — conceptual illustration
On this page

How to extract data from websites using Selenium Python? (2026 Guide).

Quick facts

What it isBrowser automation via WebDriver
Best forJS-rendered pages & interactions
LocatorsCSS selectors, XPath
Key skillExplicit waits over sleep()
Lighter altPlaywright (modern API)

Quick Setup Guide

Selenium drives a real browser from your Python code: it opens pages, clicks buttons, and reads what loads — just like a person would. That makes it good for sites that build their content with JavaScript, where a plain HTTP request would only return an empty shell. The class below is a reusable starting point. headless=True runs Chrome with no visible window (faster on servers), and WebDriverWait lets the script pause until elements actually appear instead of guessing. The __enter__/__exit__ methods let you use it with Python's with statement so the browser always closes, even on errors.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

class ModernSeleniumScraper:
    def __init__(self, headless=True):
        options = webdriver.ChromeOptions()
        if headless:
            options.add_argument('--headless')
        options.add_argument('--no-sandbox')
        options.add_argument('--disable-dev-shm-usage')
        
        self.driver = webdriver.Chrome(options=options)
        self.wait = WebDriverWait(self.driver, timeout=10)
    
    def __enter__(self):
        return self
    
    def __exit__(self, exc_type, exc_val, exc_tb):
        self.driver.quit()

Essential Features

1. Finding Elements Smartly

The biggest cause of flaky scrapers is asking for an element before the page has finished drawing it. Instead of pausing a fixed number of seconds, an explicit wait keeps checking until the element appears (or gives up after the timeout). The helper below wraps that pattern and returns None instead of crashing if the element never shows. You can locate elements by ID, by CSS selector, or by XPath — a path-like query into the page's HTML structure.

def find_element_safely(self, by, value, timeout=10):
    try:
        element = self.wait.until(
            EC.presence_of_element_located((by, value))
        )
        return element
    except TimeoutException:
        print(f'Element {value} not found within {timeout} seconds')
        return None

# Usage examples:
button = find_element_safely(By.ID, 'submit-button')
heading = find_element_safely(By.CSS_SELECTOR, 'h1.title')
link = find_element_safely(By.XPATH, '//a[contains(text(), "Next")]')

2. Handling Dynamic Content

Many sites load data after the first paint — content appears as you scroll, or a button only becomes usable once it is fully rendered. element_to_be_clickable waits until an element is both visible and enabled. For infinite-scroll pages, the loop below keeps scrolling to the bottom and stops once the page height stops growing, meaning no new content is loading.

def wait_for_dynamic_content(self, selector, timeout=10):
    try:
        # Wait for element to be clickable
        element = self.wait.until(
            EC.element_to_be_clickable((By.CSS_SELECTOR, selector))
        )
        return element
    except TimeoutException:
        print(f'Dynamic content not loaded: {selector}')
        return None

# Handle infinite scroll
def scroll_to_bottom(self):
    last_height = self.driver.execute_script('return document.body.scrollHeight')
    while True:
        self.driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
        time.sleep(2)  # Allow content to load
        
        new_height = self.driver.execute_script('return document.body.scrollHeight')
        if new_height == last_height:
            break
        last_height = new_height

3. Real-World Example: E-commerce Scraper

Here is how the pieces fit together. This class extends the base scraper to pull product details from a page: it opens the URL, waits for the product container to load, then reads each field with a small helper. get_price shows a common cleanup step — stripping out the $ and commas so the price becomes a real number you can compare or store.

class EcommerceScraper(ModernSeleniumScraper):
    def scrape_product_page(self, url):
        try:
            self.driver.get(url)
            
            # Wait for main content
            self.wait_for_dynamic_content('.product-container')
            
            return {
                'title': self.get_text('h1.product-title'),
                'price': self.get_price('.product-price'),
                'description': self.get_text('.product-description'),
                'rating': self.get_rating('.product-rating'),
                'reviews': self.get_reviews('.review-section'),
                'url': url
            }
        except Exception as e:
            print(f'Error scraping {url}: {e}')
            return None
    
    def get_text(self, selector):
        element = self.find_element_safely(By.CSS_SELECTOR, selector)
        return element.text.strip() if element else None
    
    def get_price(self, selector):
        price_elem = self.find_element_safely(By.CSS_SELECTOR, selector)
        if price_elem:
            price_text = price_elem.text.strip().replace('#39;, '').replace(',', '')
            try:
                return float(price_text)
            except ValueError:
                return None
        return None

Best Practices

A few habits keep a Selenium scraper reliable and fast.

1. Error Handling

  • Always use try-except blocks
  • Implement timeouts
  • Handle stale elements
  • Log errors properly

(A "stale" element is one Selenium found earlier but the page has since reloaded, so the old reference no longer works — re-find it.)

2. Performance Optimization

  • Use headless mode when possible
  • Implement element caching
  • Minimize page loads
  • Clean up resources

3. Anti-Detection Measures

A normal browser controlled by Selenium leaves obvious traces that tell a site it is automated. These options remove some of the most visible ones — for example, the AutomationControlled flag and the "Chrome is being controlled by automated software" infobar. Note this only hides the basics; serious anti-bot systems look much deeper.

def configure_stealth_options(self):
    options = webdriver.ChromeOptions()
    options.add_argument('--disable-blink-features=AutomationControlled')
    options.add_argument('--disable-infobars')
    options.add_experimental_option('excludeSwitches', ['enable-automation'])
    options.add_experimental_option('useAutomationExtension', False)
    return options

4. Data Validation

Before saving a result, check that the fields you actually need are present. This quick guard returns True only when title, price, and description all have values.

def validate_extracted_data(self, data):
    required_fields = ['title', 'price', 'description']
    return all(data.get(field) for field in required_fields)

Common Challenges & Solutions

1. Handling Popups

Cookie banners and newsletter popups often block the content you want. The pattern here waits briefly for a popup to appear, clicks its close button if found, and simply moves on (pass) if no popup shows up within the timeout.

def handle_popup(self):
    try:
        popup = self.wait.until(
            EC.presence_of_element_located((By.CLASS_NAME, 'popup'))
        )
        close_button = popup.find_element(By.CLASS_NAME, 'close-button')
        close_button.click()
    except TimeoutException:
        pass  # No popup found

2. Managing Sessions

To reach pages behind a login, Selenium can fill in the form just like a user. This method types the username and password, clicks submit, then waits for the dashboard to confirm the login worked. Because it uses the same browser session, later requests stay logged in.

def login(self, username, password):
    self.driver.get('https://example.com/login')
    
    username_field = self.find_element_safely(By.ID, 'username')
    password_field = self.find_element_safely(By.ID, 'password')
    
    username_field.send_keys(username)
    password_field.send_keys(password)
    
    submit = self.find_element_safely(By.ID, 'login-button')
    submit.click()
    
    return self.wait_for_dynamic_content('.dashboard')

Advanced Topics

1. Parallel Scraping

Scraping pages one at a time is slow. A ThreadPoolExecutor runs several at once — here up to four workers — so multiple pages are fetched in parallel and the results collected together. Keep the worker count modest so you do not hammer the target site.

from concurrent.futures import ThreadPoolExecutor

def scrape_multiple_pages(urls, max_workers=4):
    results = []
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(scrape_single_page, url) for url in urls]
        for future in futures:
            results.append(future.result())
    return results

2. Custom Wait Conditions

Selenium's built-in waits cover most cases, but you can write your own. A custom condition is just a function that returns True when you are ready to continue. The example waits until an element's text differs from what it was before — handy after clicking something that updates a label in place.

from selenium.webdriver.support.wait import WebDriverWait

def wait_for_text_change(self, element, original_text):
    def text_changed(driver):
        return element.text != original_text
    
    self.wait.until(text_changed)

Remember to always respect websites' terms of service and implement proper delays between requests to avoid overwhelming servers.

Related terms

Concept map

How How to extract data from websites using Selenium Python? (2026 Guide) connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Python Web Scraping
Building map…

Frequently asked questions

Why is my Selenium script not finding elements?

Almost always a timing problem: your code looks for the element before the page has rendered it. Use explicit waits (WebDriverWait plus expected_conditions) so Selenium keeps checking until the element appears, rather than time.sleep, which guesses a fixed delay and is both brittle and slow.

Is Selenium detectable as a bot?

Yes. A default WebDriver browser exposes signals like the navigator.webdriver flag (a JavaScript property that is true only for automated browsers), so protected sites can identify it quickly. A browser configured to behave like a normal user session, or a dedicated scraping API, presents a more consistent profile.

Should I use Selenium or Playwright?

Playwright is the more modern choice: a cleaner async API, automatic waiting for elements, and better defaults out of the box. Selenium is still a solid pick for existing projects and supports the widest range of programming languages.

Last updated: 2026-05-31