Web Scraping With Java

By the Scrappey Research Team

Paste into ChatGPT, Claude, or any LLM

Web Scraping With Java — conceptual illustration

On this page

Web scraping with Java means fetching a web page over HTTP and extracting structured data from its HTML, usually with Jsoup for static pages and Selenium or Playwright for JavaScript-rendered ones. Java is a strong choice for production scrapers: it is fast, strongly typed, and has first-class concurrency (virtual threads since Java 21), which matters when you are fetching thousands of pages. The standard stack in 2026 is the built-in HttpClient plus Jsoup for parsing.

Static parsing	Jsoup 1.22.x — fetch + parse + CSS selectors in one library
JavaScript pages	Selenium 4.x (auto driver) or HtmlUnit / Playwright for Java
HTTP client	java.net.http.HttpClient (built into Java 11+; virtual threads in 21+)
Concurrency	ExecutorService / virtual threads for parallel fetching
Build tool	Maven or Gradle dependency on org.jsoup:jsoup

Your first Java scraper with Jsoup

Jsoup is the workhorse of Java scraping: it fetches a page, parses the HTML into a tree, and lets you select elements with CSS selectors — all in one dependency. Add it with Maven (org.jsoup:jsoup:1.22.2) or Gradle, then:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class BookScraper {
    public static void main(String[] args) throws Exception {
        Document doc = Jsoup.connect("https://books.toscrape.com/")
                .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64)")
                .timeout(10_000)
                .get();

        Elements books = doc.select("article.product_pod");
        for (Element book : books) {
            String title = book.selectFirst("h3 > a").attr("title");
            String price = book.selectFirst(".price_color").text();
            System.out.println(title + " | " + price);
        }
    }
}

Jsoup.connect(url).get() does the HTTP request and returns a parsed Document. From there, select() takes any CSS selector and selectFirst() returns a single element. .text() reads inner text; .attr("href") reads an attribute. Always set a realistic userAgent — Jsoup's default identifies itself as Jsoup and is trivially blocked.

Following pagination and crawling

Most real jobs span many pages. Jsoup makes it easy to read the "next" link and follow it. Resolve relative URLs with absUrl() so links work no matter how they are written in the HTML:

String url = "https://books.toscrape.com/";
while (url != null) {
    Document doc = Jsoup.connect(url).userAgent("Mozilla/5.0").get();

    for (Element book : doc.select("article.product_pod")) {
        System.out.println(book.selectFirst("h3 > a").attr("title"));
    }

    Element next = doc.selectFirst("li.next > a");
    url = (next != null) ? next.absUrl("href") : null;   // null stops the loop
}

For large crawls, fetch pages in parallel. Java 21 virtual threads make this almost free — one Executors.newVirtualThreadPerTaskExecutor() can run thousands of concurrent fetches without exhausting OS threads. Throttle politely so you do not hammer the target.

Scraping JavaScript-rendered pages with Selenium

Jsoup only sees the HTML the server returns — it does not run JavaScript. For pages that render content client-side, drive a real browser with Selenium. Since Selenium 4.6, Selenium Manager downloads the matching driver automatically:

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import java.util.List;

public class DynamicScraper {
    public static void main(String[] args) {
        ChromeOptions options = new ChromeOptions();
        options.addArguments("--headless=new");
        WebDriver driver = new ChromeDriver(options);   // driver auto-managed
        try {
            driver.get("https://quotes.toscrape.com/js/");
            List<WebElement> quotes = driver.findElements(By.cssSelector(".quote"));
            for (WebElement q : quotes) {
                String text = q.findElement(By.cssSelector(".text")).getText();
                String author = q.findElement(By.cssSelector(".author")).getText();
                System.out.println(author + ": " + text);
            }
        } finally {
            driver.quit();
        }
    }
}

HtmlUnit is a lighter, headless GUI-less alternative that runs some JavaScript without a real browser, and Playwright for Java is the modern, faster option. But every browser-driving approach is heavier and more detectable than a plain HTTP fetch.

Which Java scraping library should you use?

Library	Type	Runs JS?	Best for
Jsoup	Fetch + parse	No	Static pages — the default choice
HttpClient	HTTP client (JDK)	No	APIs, custom requests, async
HtmlUnit	Headless browser	Partial	Light JS without a real browser
Selenium	Browser automation	Yes	Full JS rendering, widest docs
Playwright (Java)	Browser automation	Yes	Modern JS pages, faster than Selenium

For 90% of jobs, Jsoup alone is enough. Add a browser tool only when the data is genuinely rendered by JavaScript.

The hard part: handling anti-bot blocking

Java code is rarely the reason a scraper fails — anti-bot defenses are. Jsoup sends a TLS handshake and header set that anti-bot systems (Cloudflare, DataDome, Akamai) recognise as non-browser traffic, and headless Selenium leaks automation signals. You cannot parse a 403 or a CAPTCHA page.

Handling this means rotating residential proxies, matching a real browser TLS fingerprint — a project of its own. A managed scraping API handles all of that server-side; your Java code just POSTs the target URL and parses the returned HTML with Jsoup as usual:

Code example

java

import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;

public class ScrapingApiScraper {
    public static void main(String[] args) throws Exception {
        String payload = """
            {"cmd": "request.get", "url": "https://example.com/protected"}
            """;

        HttpRequest req = HttpRequest.newBuilder()
            .uri(URI.create("https://api.your-scraping-provider.com/v1?key=YOUR_API_KEY"))
            .header("Content-Type", "application/json")
            .POST(HttpRequest.BodyPublishers.ofString(payload))
            .build();

        HttpResponse<String> resp = HttpClient.newHttpClient()
            .send(req, HttpResponse.BodyHandlers.ofString());

        // resp.body() holds JSON; the rendered HTML is at solution.response.
        // Parse it with Jsoup exactly as you would a normal page.
        System.out.println(resp.body());
    }
}

Web scraping with C# means using .NET's HttpClient to fetch a page and a parser like HtmlAgilityPack or AngleSharp to extract data from the …

Web Scraping With Go (Golang): A Complete 2026 Guide

Web scraping with Go (Golang) means using net/http or the Colly framework to fetch pages and goquery to extract data with jQuery-like select…

Web Scraping With Node.js: A Complete 2026 Guide

Web scraping with Node.js means fetching a page (with Axios or the built-in fetch) and parsing it with Cheerio for static sites, or driving …

What Is Selenium?

Selenium is the original cross-browser automation framework — the W3C WebDriver standard predates Puppeteer by a decade. In plain terms, it …

What Is a Web Scraping API?

A web scraping API is a hosted HTTP service that visits a web page for you and hands back the result — rendered HTML, JSON, or already-parse…

XPath for Web Scraping: A Complete 2026 Guide

XPath (XML Path Language) is a query language for selecting nodes in an HTML or XML document, widely used in web scraping to pinpoint the ex…

Web Scraping With Ruby: A Complete 2026 Guide

Web scraping with Ruby means fetching a page with an HTTP gem like HTTParty and parsing the HTML with Nokogiri, which supports both CSS sele…

Concept map

How Web Scraping With Java: A Complete 2026 Guide connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections

You are here · Web Scraping by Language

Frequently asked questions

What is the best library for web scraping with Java?

Jsoup is the best default — it fetches and parses HTML with CSS selectors in a single dependency and covers most static sites. Add Selenium or Playwright for Java only when the page renders its content with JavaScript. For raw HTTP control and async requests, the JDK built-in java.net.http.HttpClient pairs well with Jsoup for parsing.

Can Java scrape JavaScript-rendered websites?

Yes, but not with Jsoup alone — Jsoup does not execute JavaScript. Use Selenium (which auto-manages its browser driver since version 4.6), Playwright for Java, or HtmlUnit for lighter cases. Alternatively, find the JSON API the page calls and request it directly with HttpClient, which is faster than driving a browser.

Why does my Java scraper get blocked, and how do I fix it?

Set a realistic User-Agent (never Jsoup’s default), throttle your request rate, and rotate residential proxies. Against serious anti-bot vendors you also need a real browser TLS fingerprint , which is hard to maintain in Java directly — many teams route hard targets through a scraping API that handles proxies, fingerprinting, and challenges server-side.

Is Java good for web scraping compared to Python?

Java is excellent for large, long-running, production scrapers thanks to its speed, strong typing, and first-class concurrency (virtual threads in Java 21). Python wins on ecosystem breadth and quick scripting. If you already run a JVM stack or need high-throughput concurrent crawling, Java is a very solid choice.

Last updated: 2026-06-08