Crawling

What Is Link Extraction in Web Crawling?

What Is Link Extraction in Web Crawling? — conceptual illustration
On this page

Link extraction is the step in a crawl where you pull every URL out of a fetched page so you can decide which to follow next. The output is a deduplicated, normalized list of absolute URLs. The implementation looks simple but the edge cases (relative URLs, fragment-only anchors, JS-only links, data-href attributes, links inside event handlers) make it the source of most "missing pages" bugs in crawlers.

Quick facts

Source<a href> primarily; also <link>, <area>, <iframe src>
StepsParse → resolve relative URLs → strip fragments → normalize → dedupe
Easy to missJS-rendered links, data-* attributes, onclick handlers
NormalizeLowercase host, sort query params, strip trailing slashes consistently
OutputSet of absolute, normalized URLs for the frontier

The basic algorithm

Parse the HTML. Select every <a> with an href attribute. Resolve each href against the page's base URL (handle <base> tag if present). Strip the fragment (#section). Drop javascript:, mailto:, tel:, and empty hrefs. Normalize: lowercase the host, decode percent-encoding, sort query params alphabetically. Dedupe against the seen-set.

Edge cases that cause missing pages

Links in data-href, data-url, or other custom attributes — most parsers ignore them. Links inside JSON-LD structured data — same. Links built dynamically from React state — only visible after rendering. PDF/document URLs in <embed> and <object> tags — easy to skip. For a thorough crawl, audit one rendered page by hand and compare against your extractor's output.

Code example

python
from urllib.parse import urljoin, urldefrag, urlparse, parse_qsl, urlencode
from bs4 import BeautifulSoup

def extract_links(html, base_url):
    soup = BeautifulSoup(html, 'html.parser')
    base = soup.find('base', href=True)
    base_url = urljoin(base_url, base['href']) if base else base_url
    links = set()
    for a in soup.select('a[href]'):
        href = a['href'].strip()
        if not href or href.startswith(('javascript:', 'mailto:', 'tel:')):
            continue
        clean = urldefrag(urljoin(base_url, href)).url
        p = urlparse(clean)
        sorted_q = urlencode(sorted(parse_qsl(p.query)))
        links.add(p._replace(query=sorted_q).geturl())
    return links

Related terms

Concept map

How Link Extraction connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Crawling
Building map…

Tools & solutions for this topic

Frequently asked questions

Should I normalize trailing slashes?

Yes, but pick one rule and apply it consistently. Trailing-slash and no-trailing-slash of the same path are usually the same page; treating them as different doubles your crawl budget for nothing.

Do I extract links from PDFs?

Only if your scope requires it. PDF link extraction is its own problem (pdfplumber or similar) and most crawls do not need it.

What about links in noindex pages?

For SEO crawls, follow them but tag them as "from-noindex" — search engines do not credit them for ranking. For data crawls, follow them like any other link.

Last updated: 2026-05-26