How to Get All Links From a Webpage

By the Scrappey Research Team

Paste into ChatGPT, Claude, or any LLM

How to Get All Links From a Webpage — conceptual illustration

On this page

Getting all links from a webpage means downloading the page, reading every <a href> attribute (the URL inside each link tag), turning relative URLs into full ones, cleaning them up (fragments and query-string order), and removing duplicates. On a static page this is a one-liner; on a JavaScript-rendered page you must run the page's scripts first; and in a crawling pipeline you also want to filter by domain, by URL pattern, or by rel attributes such as nofollow and ugc.

Static pages	BeautifulSoup or cheerio + urljoin for relatives
JS-rendered	Playwright/headless browser, then querySelectorAll(a)
Always do	Resolve relatives, strip fragments, lowercase host, dedupe
Filter on	Same domain, URL pattern, rel attribute, link text
Watch for	Links in onclick handlers, data-href, JS-only navigation

The basic pattern (static HTML)

The plan is straightforward: download the page, parse the HTML, walk through every <a> tag that has an href, and turn each href into a full URL by resolving it against the document's base URL (a relative link like /page is just shorthand for the complete address). Strip the fragment - everything after the # - unless you specifically care about anchored links. Normalize the host to lowercase and put the path in a canonical (consistent) form. Drop the junk: empty hrefs, javascript: pseudo-links, and mailto: addresses. Finally, dedupe so the same URL is not listed twice.

When you need a real browser

Modern SPAs (single-page apps - sites that build the page in your browser with JavaScript) and infinite-scroll feeds add links to the DOM (the live, in-memory version of the page) only after the initial HTML loads. A plain static fetch never runs that JavaScript, so it misses those links. Use Playwright (or a JS-rendering scraping API), wait for the page to settle, then run document.querySelectorAll('a[href]') in the browser context to read the finished page. For infinite scroll, scroll to the bottom in steps and collect links after each scroll until no new ones appear.

Filtering for crawl pipelines

For a focused crawler you usually want fewer links, not more, so filter aggressively: same-domain only (or a domain allowlist), URL patterns that match real content paths (skip /login, /cart, and asset paths), and respect rel="nofollow" if you care about the crawled site's signal. rel="nofollow" is a hint a site adds to a link to say "do not pass ranking credit through here." For SEO link extraction, keep the rel attributes as metadata rather than filtering on them.

Code example

python

from urllib.parse import urljoin, urldefrag
import requests
from bs4 import BeautifulSoup

def get_links(url):
    r = requests.get(url, timeout=30)
    soup = BeautifulSoup(r.text, 'html.parser')
    links = set()
    for a in soup.select('a[href]'):
        href = a['href'].strip()
        if not href or href.startswith(('javascript:', 'mailto:', 'tel:')):
            continue
        links.add(urldefrag(urljoin(r.url, href)).url)
    return sorted(links)

Related terms

What Is Link Extraction?

Link extraction is the crawling step where you pull every URL out of a page you have just downloaded, so you can decide which ones to visit …

What Is a Web Scraping API?

A web scraping API is a hosted HTTP service that visits a web page for you and hands back the result — rendered HTML, JSON, or already-parse…

How to Scrape Infinite-Scroll Pages

Infinite scroll is the page design where new content keeps loading on its own as you scroll down (like a social feed that never ends). To sc…

What Is a Headless Browser?

A headless browser is a real web browser — Chrome, Firefox, or WebKit — that runs without a visible window, driven entirely by code instead …

What Is mitmproxy?

mitmproxy is a free tool that sits between an app and the internet so you can read and change the HTTPS traffic passing through it. The name…

What Is Web Scraping as a Service?

Web scraping as a service (WSaaS) is a managed, cloud-based offering that handles web data extraction for you through an API or dashboard - …

What Is a CSS Selector?

A CSS selector is a pattern that picks out specific elements in an HTML document by matching their tag, class, id, attributes, or position. …

What Is an XPath Selector?

XPath (XML Path Language) is a query language for navigating the tree structure of an HTML or XML document to select elements by their path,…

What Are Regular Expressions (Regex)?

A regular expression (regex) is a compact pattern that describes a set of strings, used to find, match, and extract text. The pattern \d{3}-…

Concept map

How How to Get All Links From a Webpage connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections

You are here · Web Scraping APIs

Tools & solutions for this topic

Frequently asked questions

Why are some links missing from my extraction?

Most likely those links are added by JavaScript after the page first loads, so a plain static fetch never sees them. Switch to a headless browser (a real browser engine running with no window) or a JS-rendering API, and wait for the DOM to settle before reading the links.

Should I follow rel="nofollow" links?

For crawling, yes - nofollow is a signal to search engines about PageRank (how ranking credit flows between pages), not a rule that blocks access. For SEO analysis, surface the attribute as metadata rather than filtering those links out.

How do I handle infinite scroll?

Scroll the page programmatically in a loop, collecting links after each scroll, until two scrolls in a row return the same set of links (meaning nothing new loaded). Cap the number of iterations so you do not get stuck on a feed that scrolls forever.

Last updated: 2026-05-31