The basic algorithm
Parse the HTML. Select every <a> with an href attribute. Resolve each href against the page's base URL (handle <base> tag if present). Strip the fragment (#section). Drop javascript:, mailto:, tel:, and empty hrefs. Normalize: lowercase the host, decode percent-encoding, sort query params alphabetically. Dedupe against the seen-set.
JS-rendered links
Modern sites load many links via JavaScript: lazy-rendered cards, "load more" buttons, infinite scroll. A static HTML parser misses them entirely. Either render the page in a browser before extracting, or identify the underlying XHR endpoint and crawl from its JSON response directly. Both are valid; the XHR path is usually cheaper if the endpoint is accessible.
Edge cases that cause missing pages
Links in data-href, data-url, or other custom attributes — most parsers ignore them. Links inside JSON-LD structured data — same. Links built dynamically from React state — only visible after rendering. PDF/document URLs in <embed> and <object> tags — easy to skip. For a thorough crawl, audit one rendered page by hand and compare against your extractor's output.
