Python Web Scraping

BeautifulSoup vs lxml: HTML Parsing

By the Scrappey Research Team

BeautifulSoup vs lxml: HTML Parsing — conceptual illustration
On this page

BeautifulSoup and lxml are both Python HTML parsers, but lxml is a fast C-backed library with XPath support, while BeautifulSoup is a friendlier navigation layer that can use lxml as its parsing backend. In practice they are not strict rivals: parsing the same page, lxml's libxml2 engine is much faster, but most people reach for BeautifulSoup's readable find/find_all API and quietly let lxml do the heavy lifting underneath. This guide compares speed, query syntax, memory, and malformed-HTML handling so you can pick the right combination.

Quick facts

lxml isC-backed parser (libxml2/libxslt)
BeautifulSoup isPure-Python navigation API over a parser
Speedlxml much faster; BS4+lxml in between
Query syntaxlxml: XPath + CSS; BS4: find/find_all + CSS select
Use together?Yes - BS4 with parser='lxml'

Speed and memory: why lxml wins on raw parsing

lxml is faster because it is not really a Python parser - it is a thin binding over libxml2 and libxslt, two mature C libraries, with the glue code compiled via Cython. BeautifulSoup, by contrast, is pure Python: it builds a tree of Tag and NavigableString objects and walks them in the interpreter. On large documents or tight loops that gap is real - independent benchmarks consistently show lxml parsing the same HTML several times faster than BeautifulSoup running on its slower backends.

Part of the difference is which parser actually does the work. BeautifulSoup does not parse HTML itself; it delegates to a backend you choose in the constructor:

  • 'lxml' - fastest, C-backed, needs the lxml package installed.
  • 'html.parser' - Python's built-in standard-library parser; no extra dependency, noticeably slower.
  • 'html5lib' - pure-Python, follows the HTML5 spec most faithfully and fixes the most broken markup, but is by far the slowest.

Memory follows the same pattern. lxml stores nodes in a compact C structure, while BeautifulSoup's wrapper objects carry more per-element Python overhead. For one-off scraping of a single page none of this matters; for parsing thousands of documents in a worker, the C-backed path keeps both time and RAM down.

Query syntax: XPath in lxml vs find_all and CSS in BeautifulSoup

The biggest day-to-day difference is how you locate elements. lxml's lxml.html module exposes an ElementTree-style API and supports full XPath 1.0 through the .xpath() method, which is unmatched for deep, conditional, or relative queries (selecting a node by a sibling's text, walking up to an ancestor, indexing into matches). It also ships CSS selector support via .cssselect(), which the bundled cssselect library translates into XPath under the hood.

BeautifulSoup deliberately keeps things plainer. You navigate with readable methods - find(), find_all(), attribute filters like class_=, and tree-walking properties such as .parent, .next_sibling, and .children. For selector fans it also offers .select() and .select_one(), which accept CSS selectors. What BeautifulSoup does not offer is native XPath; if you need XPath expressions, that is squarely lxml's territory (and the reason Scrapy's selectors are built on lxml). The trade is clarity for power: BeautifulSoup code tends to read like prose, while an lxml XPath one-liner can express a query that would take several chained BeautifulSoup calls.

When each fits - and using them together

Reach for lxml when speed matters, when you are processing high volumes, when the document is also valid XML, or when you want XPath. Reach for BeautifulSoup when readability and a gentle learning curve matter, for quick prototypes, and when the HTML is badly malformed - its html5lib backend and UnicodeDammit encoding detection are forgiving in ways a strict parser is not.

The two also compose. The most common pattern is to keep BeautifulSoup's friendly API but pass 'lxml' as the parser, getting C-level speed with an approachable interface. Going the other direction, lxml ships lxml.html.soupparser so you can feed BeautifulSoup-parsed trees into lxml, and lxml's own docs suggest a pragmatic fallback - parse with lxml first, and re-parse with BeautifulSoup only when encoding or breakage trips it up. Both libraries assume you already have the HTML in hand; fetching it reliably at scale (rotating proxies, headless browser rendering, retries, anti-bot challenges) is a separate problem. A managed web-data API such as Scrappey can return the rendered HTML in one call, which you then hand to lxml or BeautifulSoup exactly as below.

Code example

python
# Same page, two parsers - and the hybrid that gets you both.
import requests
from bs4 import BeautifulSoup
from lxml import html

resp = requests.get("https://example.com/product")
resp.raise_for_status()
html_text = resp.text

# 1) BeautifulSoup: readable API. Note parser='lxml' = C-backed speed.
soup = BeautifulSoup(html_text, "lxml")
title_bs = soup.find("h1").get_text(strip=True)
price_bs = soup.select_one("span.price").get_text(strip=True)  # CSS selector
print("bs4 ->", title_bs, price_bs)

# 2) lxml: XPath, returns a list. Great for deep/conditional queries.
tree = html.fromstring(html_text)
title_lx = tree.xpath("//h1/text()")[0].strip()
# 'normalize-space' trims whitespace inside the matched node:
price_lx = tree.xpath("normalize-space(//span[@class='price'])")
print("lxml ->", title_lx, price_lx)

# lxml also speaks CSS via cssselect (translated to XPath internally):
for row in tree.cssselect("table.specs tr"):
    cells = [c.text_content().strip() for c in row.cssselect("td")]
    print(cells)

Related terms

Concept map

How BeautifulSoup vs lxml: HTML Parsing connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Python Web Scraping
Building map…

Frequently asked questions

Is lxml always faster than BeautifulSoup?

For raw parsing, lxml is consistently faster because it runs on the compiled C libraries libxml2 and libxslt, while BeautifulSoup is pure Python. The catch is that BeautifulSoup can use lxml as its backend (BeautifulSoup(html, 'lxml')), which closes most of the gap while keeping BeautifulSoup's friendlier API.

Can BeautifulSoup use lxml, and can lxml use BeautifulSoup?

Yes to both. You pass 'lxml' as the parser argument to BeautifulSoup to make it parse with lxml, and lxml provides lxml.html.soupparser to convert or build trees with BeautifulSoup when you need its lenient, encoding-tolerant parsing for very broken HTML.

Does BeautifulSoup support XPath?

No, BeautifulSoup has no native XPath. It offers find/find_all, attribute filters, tree navigation, and CSS selectors through .select(). If you need XPath expressions, use lxml's .xpath() method (which is also what Scrapy's selectors are built on).

Which should I choose for a new project?

For learning, quick scripts, or messy HTML, start with BeautifulSoup for its readability and forgiving parsing. For high-volume parsing, valid XML, or complex XPath queries, choose lxml. A common middle ground is BeautifulSoup with the lxml parser, which gives you speed and a clean API at once.

Last updated: 2026-06-16 · Facts last verified: 2026-06-16