The basic pattern (static HTML)
Fetch the page, parse the HTML, iterate every <a> with an href attribute, and resolve each href against the document's base URL. Strip the fragment (everything after #) unless you specifically care about anchored links. Normalize the host to lowercase and the path to a canonical form. Drop empty hrefs, javascript: pseudo-links, and mailto:. Dedupe.
When you need a real browser
Modern SPAs and infinite-scroll feeds add links to the DOM after the initial HTML loads. A static fetch misses them. Use Playwright (or a JS-rendering scraping API), wait for the page to settle, then run document.querySelectorAll('a[href]') in the browser context. For infinite scroll, scroll to the bottom in steps and collect links after each scroll until no new links appear.
Filtering for crawl pipelines
For a focused crawler, filter aggressively: same-domain only (or a domain allowlist), URL patterns that match content paths (skip /login, /cart, asset paths), and respect rel="nofollow" if you care about the crawled site's signal. For SEO link extraction, keep the rel attributes as metadata rather than filtering on them.
