XPath for Web Scraping

By the Scrappey Research Team

Paste into ChatGPT, Claude, or any LLM

XPath for Web Scraping — conceptual illustration

On this page

XPath (XML Path Language) is a query language for selecting nodes in an HTML or XML document, widely used in web scraping to pinpoint the exact elements you want to extract. Where CSS selectors target elements by tag, class, and id, XPath can also select by an element's text content and navigate in any direction — to parents, siblings, and ancestors. It is supported by Python's lxml and Scrapy, browser DevTools, Selenium, and Playwright.

What it is	A path language for selecting nodes in HTML/XML
Select by text	//button[text()="Buy"] — CSS cannot do this
Navigate up	Axes: parent::, ancestor::, following-sibling::
Python	lxml: tree.xpath("//...") ; also Scrapy/parsel
Browser/JS	Selenium, Playwright, DevTools $x("//...")

Why use XPath for scraping?

HTML is a tree of nested elements, and XPath is a way to write a path to any node in that tree. You can test any XPath instantly in your browser: open DevTools, go to the Console, and run $x('//h1') to get an array of matching elements.

XPath earns its place alongside CSS selectors because it can do two things CSS cannot: select an element by its visible text (//a[text()="Next"]) and walk back up the tree to a parent or previous sibling. When a page has no helpful classes or ids, or you need "the price next to this label," XPath is often the only clean option.

XPath syntax cheat sheet

Expression	Selects
`//div`	All <div> elements anywhere in the document
`/html/body/div`	<div> that is a direct child of body
`//div/p`	<p> that is a direct child of any <div>
`//div//a`	<a> anywhere inside any <div> (descendant)
`//*[@id="main"]`	Any element with id="main"
`//a[@class="btn"]`	<a> with class exactly "btn"
`//a/@href`	The href attribute value of every <a>
`//h1/text()`	The text node inside each <h1>
`(//div[@class="p"])[1]`	The first matching <div>
`//li[last()]`	The last <li> in its list

The two leading-slash forms are the ones you use constantly: // means "anywhere below," and / means "direct child." Attributes are matched in square brackets with @.

Predicates, functions, and axes

Predicates (square brackets) filter matches, and XPath ships handy functions for partial and text matching:

Expression	Selects
`//div[contains(@class,"product")]`	class contains "product" (partial match)
`//a[starts-with(@href,"/p/")]`	href begins with "/p/"
`//button[text()="Add to cart"]`	button whose text is exactly that
`//span[contains(text(),"in stock")]`	span whose text contains the phrase
`//input[@type="email" and @required]`	multiple conditions with and / or

Axes are XPath's superpower — they let you move in any direction, which CSS cannot:

Axis	Example	Meaning
parent	`//span[@class="price"]/parent::div`	The div wrapping the price span
ancestor	`//a/ancestor::article`	The article a link sits inside
following-sibling	`//dt[text()="Price"]/following-sibling::dd`	The value next to a label
preceding-sibling	`//dd/preceding-sibling::dt[1]`	The label before a value

The following-sibling pattern — "find the label, then take the value beside it" — is one of the most useful real-world scraping tricks, and it is impossible with CSS selectors alone.

XPath in Python with lxml

The standard way to use XPath in Python is the lxml library (also the engine behind Scrapy and parsel). Install with pip install lxml requests.

import requests
from lxml import html

resp = requests.get("https://books.toscrape.com/")
tree = html.fromstring(resp.content)

# Select titles and prices with XPath:
titles = tree.xpath('//article[@class="product_pod"]//h3/a/@title')
prices = tree.xpath('//p[@class="price_color"]/text()')

for title, price in zip(titles, prices):
    print(title, "|", price)

# Using an axis: the rating sits as a class on a sibling element.
first = tree.xpath('(//article[@class="product_pod"])[1]')[0]
rating = first.xpath('.//p[contains(@class,"star-rating")]/@class')
print(rating)   # e.g. ['star-rating Three']

tree.xpath() returns a list — of elements, strings (for text()), or attribute values (for @attr). Note the leading . in .// when running XPath relative to an element you already selected.

XPath in Node.js with Playwright

Almost every XPath tutorial is Python-only, but XPath works just as well in JavaScript. Playwright supports XPath selectors directly with the xpath= prefix, and it runs the JavaScript first so XPath sees the fully rendered DOM:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  await page.goto('https://books.toscrape.com/');

  // Playwright accepts XPath with the xpath= prefix:
  const titles = await page
    .locator('xpath=//article[@class="product_pod"]//h3/a')
    .evaluateAll((els) => els.map((e) => e.getAttribute('title')));

  console.log(titles);
  await browser.close();
})();

For static HTML in Node without a browser, the xpath + @xmldom/xmldom packages evaluate XPath against a parsed document. But Playwright is the cleaner route when the page needs JavaScript anyway.

XPath vs CSS selectors — and why scrapers fail on protected sites

Need	XPath	CSS selector
Select by tag/class/id	Yes	Yes (shorter)
Select by visible text	Yes	No
Navigate to parent/ancestor	Yes	No
Select previous sibling	Yes	No
Readability for simple cases	Verbose	Cleaner

Use CSS selectors for everyday selecting and XPath when you need text matching or tree navigation. Many scrapers mix both.

One thing neither solves: a selector only works if you actually received the real HTML. If the page is JavaScript-rendered or the site blocks you, your perfect XPath matches nothing. A scraping API returns the fully rendered HTML so your XPath always has real markup to query.

Code example

python

import requests
from lxml import html

# Get real, rendered, unblocked HTML, then query it with XPath.
resp = requests.post(
    'https://api.your-scraping-provider.com/v1?key=YOUR_API_KEY',
    json={'cmd': 'request.get', 'url': 'https://example.com/products'},
    timeout=120,
)
page = resp.json()['solution']['response']
tree = html.fromstring(page)

for row in tree.xpath('//article[@class="product_pod"]'):
    title = row.xpath('.//h3/a/@title')[0]
    price = row.xpath('.//p[@class="price_color"]/text()')[0]
    print(title, '|', price)

To parse HTML in Python you load the markup into a parser that turns it into a navigable tree, then select the elements you want with CSS se…

Which Python libraries are best for web scraping? (2026 Guide)

If you want to scrape websites with Python, the first decision is which library to use. There are a handful of popular ones, and each fits a…

Web Scraping With Node.js: A Complete 2026 Guide

Web scraping with Node.js means fetching a page (with Axios or the built-in fetch) and parsing it with Cheerio for static sites, or driving …

Web Scraping With Java: A Complete 2026 Guide

Web scraping with Java means fetching a web page over HTTP and extracting structured data from its HTML, usually with Jsoup for static pages…

What Is Web Scraping?

Web scraping is the automated extraction of structured data from websites. Instead of a person copying and pasting, a program (a "scraper") …

What Is a Web Scraping API?

A web scraping API is a hosted HTTP service that visits a web page for you and hands back the result — rendered HTML, JSON, or already-parse…

Web Scraping With C#: A Complete 2026 Guide

Web scraping with C# means using .NET's HttpClient to fetch a page and a parser like HtmlAgilityPack or AngleSharp to extract data from the …

Web Scraping With Go (Golang): A Complete 2026 Guide

Web scraping with Go (Golang) means using net/http or the Colly framework to fetch pages and goquery to extract data with jQuery-like select…

Web Scraping With Ruby: A Complete 2026 Guide

Web scraping with Ruby means fetching a page with an HTTP gem like HTTParty and parsing the HTML with Nokogiri, which supports both CSS sele…

Web Scraping With PHP: A Complete 2026 Guide

Web scraping with PHP means fetching pages with the Guzzle HTTP client and extracting data with Symfony's DomCrawler component, which suppor…

Web Scraping With R: A Complete 2026 Guide

Web scraping with R means using the rvest package to download and parse HTML into tidy data frames, with CSS selectors or XPath. rvest is th…

Web Scraping With curl: A Complete 2026 Guide

Web scraping with curl means fetching pages directly from the command line, setting headers, cookies, and proxies with curl's flags, then pi…

Concept map

How XPath for Web Scraping: A Complete 2026 Guide connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections

You are here · Web Scraping by Language

Frequently asked questions

What is XPath in web scraping?

XPath (XML Path Language) is a query language for selecting nodes in an HTML or XML document. In web scraping you use it to pinpoint the elements you want to extract — by tag, attribute, position, or text content. Unlike CSS selectors, XPath can also select elements by their visible text and navigate to parent, ancestor, and sibling nodes.

XPath vs CSS selectors — which is better for scraping?

Use CSS selectors for everyday selection by tag, class, and id; they are shorter and very readable. Use XPath when you need something CSS cannot do: selecting by visible text (//button[text()="Buy"]), walking up to a parent or ancestor, or grabbing the value next to a label with following-sibling. Most experienced scrapers use both, picking whichever is cleaner for each case.

How do I test an XPath expression?

In your browser, open DevTools, switch to the Console tab, and run $x("//your/xpath") — it returns an array of matching elements you can inspect immediately. In the Elements panel you can also right-click an element and choose Copy > Copy XPath, though hand-written XPaths are usually more robust than the auto-generated ones.

Can I use XPath in JavaScript and Node.js?

Yes. Most tutorials are Python-focused, but Playwright supports XPath selectors directly with the xpath= prefix, and Selenium does too. For static HTML in Node without a browser, the xpath package combined with @xmldom/xmldom evaluates XPath against a parsed document. In the browser itself, document.evaluate and the DevTools $x() helper run XPath natively.

Last updated: 2026-06-08