What Is Link Extraction in Web Crawling?

By the Scrappey Research Team

Paste into ChatGPT, Claude, or any LLM

What Is Link Extraction in Web Crawling? — conceptual illustration

On this page

Link extraction is the crawling step where you pull every URL out of a page you have just downloaded, so you can decide which ones to visit next. The result is a clean list of full (absolute) URLs with duplicates removed. It sounds trivial, but the awkward cases - relative URLs (shorthand paths like /about that omit the domain), anchors that only jump within the page, links that exist only after JavaScript runs, data-href attributes, and links buried in event handlers - make it the number-one cause of "missing pages" bugs in crawlers.

Source	<a href> primarily; also <link>, <area>, <iframe src>
Steps	Parse → resolve relative URLs → strip fragments → normalize → dedupe
Easy to miss	JS-rendered links, data-* attributes, onclick handlers
Normalize	Lowercase host, sort query params, strip trailing slashes consistently
Output	Set of absolute, normalized URLs for the frontier

The basic algorithm

Parse the HTML (turn the raw text into a structure you can search). Select every <a> tag that has an href attribute. Resolve each href against the page's base URL - that is, combine a relative path like /about with the page's address to get a full URL (and respect the <base> tag if the page has one). Strip the fragment (the #section part, which only scrolls within a page). Drop javascript:, mailto:, tel:, and empty hrefs, since none of those are pages to crawl. Normalize so equivalent URLs look identical: lowercase the host, decode percent-encoding (turn %20 back into a space), and sort query parameters alphabetically. Finally, dedupe against the seen-set - the running list of URLs you have already collected.

JS-rendered links

Modern sites load many links via JavaScript: lazy-rendered cards, "load more" buttons, infinite scroll. A static HTML parser - one that only reads the originally downloaded HTML and never runs scripts - misses them entirely. You have two options. Either render the page in a real browser before extracting, so the JavaScript runs and the links appear; or find the underlying XHR endpoint (the background request the page makes to fetch its data) and crawl from its JSON response directly. Both are valid; the XHR path is usually cheaper if the endpoint is accessible.

Edge cases that cause missing pages

Links in data-href, data-url, or other custom attributes — most parsers ignore them because they only look at standard href. Links inside JSON-LD structured data (machine-readable metadata embedded in the page) — same problem. Links built dynamically from React state — only visible after the page renders. PDF/document URLs in <embed> and <object> tags — easy to skip. For a thorough crawl, audit one rendered page by hand and compare against your extractor's output to see what it is dropping.

Code example

python

from urllib.parse import urljoin, urldefrag, urlparse, parse_qsl, urlencode
from bs4 import BeautifulSoup

def extract_links(html, base_url):
    soup = BeautifulSoup(html, 'html.parser')
    base = soup.find('base', href=True)
    base_url = urljoin(base_url, base['href']) if base else base_url
    links = set()
    for a in soup.select('a[href]'):
        href = a['href'].strip()
        if not href or href.startswith(('javascript:', 'mailto:', 'tel:')):
            continue
        clean = urldefrag(urljoin(base_url, href)).url
        p = urlparse(clean)
        sorted_q = urlencode(sorted(parse_qsl(p.query)))
        links.add(p._replace(query=sorted_q).geturl())
    return links

Related terms

How to Get All Links From a Webpage

Getting all links from a webpage means downloading the page, reading every <a href> attribute (the URL inside each link tag), turning relati…

What Is a Web Crawler?

A web crawler is a program that finds and downloads web pages on its own by following links - it starts from a few given pages (called seed …

What Is a Sitemap?

A sitemap is an XML (or sometimes plain-text) file that lists a site's canonical URLs along with optional metadata: last-modified date, chan…

What Is Crawl Depth Limit?

Crawl depth limit is the maximum number of link hops a crawler will follow from a seed URL. A "hop" is one click along a link. The page you …

What Is a 404 Error?

HTTP 404 Not Found is the server's way of saying "I understood your request, but there is nothing at this address." The server is working fi…

Best Web Scraping API for SEO Audits

The best web scraping API for SEO audits combines reliable SERP scraping (Google, Bing, regional engines) with on-page extraction — title, m…

List Crawling in Web Scraping

List crawling is the technique of crawling paginated list, category, or index pages to enumerate the URLs of individual items, then fetching…

Concept map

How Link Extraction connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections

You are here · Crawling

Tools & solutions for this topic

Frequently asked questions

Should I normalize trailing slashes?

Yes, but pick one rule and apply it consistently. A path with a trailing slash and the same path without one usually point to the exact same page; if you treat them as two different URLs, you crawl that page twice and waste crawl budget for nothing.

Do I extract links from PDFs?

Only if your scope requires it. Pulling links out of PDFs is a separate problem with its own tools (pdfplumber or similar), and most crawls never need it.

What about links in noindex pages?

A noindex page is one that asks search engines not to list it in results. For SEO crawls, still follow its links but tag them as "from-noindex", because search engines do not give those links ranking credit. For data crawls, just follow them like any other link.

Last updated: 2026-05-31