Why use XPath for scraping?
HTML is a tree of nested elements, and XPath is a way to write a path to any node in that tree. You can test any XPath instantly in your browser: open DevTools, go to the Console, and run $x('//h1') to get an array of matching elements.
XPath earns its place alongside CSS selectors because it can do two things CSS cannot: select an element by its visible text (//a[text()="Next"]) and walk back up the tree to a parent or previous sibling. When a page has no helpful classes or ids, or you need "the price next to this label," XPath is often the only clean option.
XPath syntax cheat sheet
| Expression | Selects |
|---|---|
//div | All <div> elements anywhere in the document |
/html/body/div | <div> that is a direct child of body |
//div/p | <p> that is a direct child of any <div> |
//div//a | <a> anywhere inside any <div> (descendant) |
//*[@id="main"] | Any element with id="main" |
//a[@class="btn"] | <a> with class exactly "btn" |
//a/@href | The href attribute value of every <a> |
//h1/text() | The text node inside each <h1> |
(//div[@class="p"])[1] | The first matching <div> |
//li[last()] | The last <li> in its list |
The two leading-slash forms are the ones you use constantly: // means "anywhere below," and / means "direct child." Attributes are matched in square brackets with @.
Predicates, functions, and axes
Predicates (square brackets) filter matches, and XPath ships handy functions for partial and text matching:
| Expression | Selects |
|---|---|
//div[contains(@class,"product")] | class contains "product" (partial match) |
//a[starts-with(@href,"/p/")] | href begins with "/p/" |
//button[text()="Add to cart"] | button whose text is exactly that |
//span[contains(text(),"in stock")] | span whose text contains the phrase |
//input[@type="email" and @required] | multiple conditions with and / or |
Axes are XPath's superpower — they let you move in any direction, which CSS cannot:
| Axis | Example | Meaning |
|---|---|---|
| parent | //span[@class="price"]/parent::div | The div wrapping the price span |
| ancestor | //a/ancestor::article | The article a link sits inside |
| following-sibling | //dt[text()="Price"]/following-sibling::dd | The value next to a label |
| preceding-sibling | //dd/preceding-sibling::dt[1] | The label before a value |
The following-sibling pattern — "find the label, then take the value beside it" — is one of the most useful real-world scraping tricks, and it is impossible with CSS selectors alone.
XPath in Python with lxml
The standard way to use XPath in Python is the lxml library (also the engine behind Scrapy and parsel). Install with pip install lxml requests.
import requests
from lxml import html
resp = requests.get("https://books.toscrape.com/")
tree = html.fromstring(resp.content)
# Select titles and prices with XPath:
titles = tree.xpath('//article[@class="product_pod"]//h3/a/@title')
prices = tree.xpath('//p[@class="price_color"]/text()')
for title, price in zip(titles, prices):
print(title, "|", price)
# Using an axis: the rating sits as a class on a sibling element.
first = tree.xpath('(//article[@class="product_pod"])[1]')[0]
rating = first.xpath('.//p[contains(@class,"star-rating")]/@class')
print(rating) # e.g. ['star-rating Three']tree.xpath() returns a list — of elements, strings (for text()), or attribute values (for @attr). Note the leading . in .// when running XPath relative to an element you already selected.
XPath in Node.js with Playwright
Almost every XPath tutorial is Python-only, but XPath works just as well in JavaScript. Playwright supports XPath selectors directly with the xpath= prefix, and it runs the JavaScript first so XPath sees the fully rendered DOM:
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://books.toscrape.com/');
// Playwright accepts XPath with the xpath= prefix:
const titles = await page
.locator('xpath=//article[@class="product_pod"]//h3/a')
.evaluateAll((els) => els.map((e) => e.getAttribute('title')));
console.log(titles);
await browser.close();
})();For static HTML in Node without a browser, the xpath + @xmldom/xmldom packages evaluate XPath against a parsed document. But Playwright is the cleaner route when the page needs JavaScript anyway.
XPath vs CSS selectors — and why scrapers fail on protected sites
| Need | XPath | CSS selector |
|---|---|---|
| Select by tag/class/id | Yes | Yes (shorter) |
| Select by visible text | Yes | No |
| Navigate to parent/ancestor | Yes | No |
| Select previous sibling | Yes | No |
| Readability for simple cases | Verbose | Cleaner |
Use CSS selectors for everyday selecting and XPath when you need text matching or tree navigation. Many scrapers mix both.
One thing neither solves: a selector only works if you actually received the real HTML. If the page is JavaScript-rendered or the site blocks you, your perfect XPath matches nothing. A scraping API returns the fully rendered HTML so your XPath always has real markup to query.