Web Scraping APIs

How to Get All Links From a Webpage

How to Get All Links From a Webpage — conceptual illustration
On this page

Getting all links from a webpage means fetching the page, parsing every <a href> attribute, resolving relative URLs against the base, normalizing for fragments and query order, and deduplicating the result. For static pages this is a one-liner; for JS-rendered pages you need to execute the script first; for crawling pipelines you also want to filter by domain, by URL pattern, or by rel attributes (nofollow, ugc).

Quick facts

Static pagesBeautifulSoup or cheerio + urljoin for relatives
JS-renderedPlaywright/headless browser, then querySelectorAll(a)
Always doResolve relatives, strip fragments, lowercase host, dedupe
Filter onSame domain, URL pattern, rel attribute, link text
Watch forLinks in onclick handlers, data-href, JS-only navigation

The basic pattern (static HTML)

Fetch the page, parse the HTML, iterate every <a> with an href attribute, and resolve each href against the document's base URL. Strip the fragment (everything after #) unless you specifically care about anchored links. Normalize the host to lowercase and the path to a canonical form. Drop empty hrefs, javascript: pseudo-links, and mailto:. Dedupe.

When you need a real browser

Modern SPAs and infinite-scroll feeds add links to the DOM after the initial HTML loads. A static fetch misses them. Use Playwright (or a JS-rendering scraping API), wait for the page to settle, then run document.querySelectorAll('a[href]') in the browser context. For infinite scroll, scroll to the bottom in steps and collect links after each scroll until no new links appear.

Filtering for crawl pipelines

For a focused crawler, filter aggressively: same-domain only (or a domain allowlist), URL patterns that match content paths (skip /login, /cart, asset paths), and respect rel="nofollow" if you care about the crawled site's signal. For SEO link extraction, keep the rel attributes as metadata rather than filtering on them.

Code example

python
from urllib.parse import urljoin, urldefrag
import requests
from bs4 import BeautifulSoup

def get_links(url):
    r = requests.get(url, timeout=30)
    soup = BeautifulSoup(r.text, 'html.parser')
    links = set()
    for a in soup.select('a[href]'):
        href = a['href'].strip()
        if not href or href.startswith(('javascript:', 'mailto:', 'tel:')):
            continue
        links.add(urldefrag(urljoin(r.url, href)).url)
    return sorted(links)

Related terms

Concept map

How How to Get All Links From a Webpage connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Web Scraping APIs
Building map…

Frequently asked questions

Why are some links missing from my extraction?

Likely they are added by JavaScript after the page loads. Switch to a headless browser or JS-rendering API, then wait for the DOM to settle before reading.

Should I follow rel="nofollow" links?

For crawling, yes — nofollow is a signal to search engines about PageRank, not an access restriction. For SEO analysis, surface the attribute as metadata rather than filtering it out.

How do I handle infinite scroll?

Scroll programmatically in a loop, collecting links after each scroll, until two consecutive iterations return the same set. Cap the iteration count to avoid runaway loops on truly infinite feeds.

Last updated: 2026-05-26