Web Scraping by Language

Web Scraping With Java

On this page

Web scraping with Java means fetching a web page over HTTP and extracting structured data from its HTML, usually with Jsoup for static pages and Selenium or Playwright for JavaScript-rendered ones. Java is a strong choice for production scrapers: it is fast, strongly typed, and has first-class concurrency (virtual threads since Java 21), which matters when you are fetching thousands of pages. The standard stack in 2026 is the built-in HttpClient plus Jsoup for parsing.

Quick facts

Static parsingJsoup 1.22.x — fetch + parse + CSS selectors in one library
JavaScript pagesSelenium 4.x (auto driver) or HtmlUnit / Playwright for Java
HTTP clientjava.net.http.HttpClient (built into Java 11+; virtual threads in 21+)
ConcurrencyExecutorService / virtual threads for parallel fetching
Build toolMaven or Gradle dependency on org.jsoup:jsoup

Your first Java scraper with Jsoup

Jsoup is the workhorse of Java scraping: it fetches a page, parses the HTML into a tree, and lets you select elements with CSS selectors — all in one dependency. Add it with Maven (org.jsoup:jsoup:1.22.2) or Gradle, then:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class BookScraper {
    public static void main(String[] args) throws Exception {
        Document doc = Jsoup.connect("https://books.toscrape.com/")
                .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64)")
                .timeout(10_000)
                .get();

        Elements books = doc.select("article.product_pod");
        for (Element book : books) {
            String title = book.selectFirst("h3 > a").attr("title");
            String price = book.selectFirst(".price_color").text();
            System.out.println(title + " | " + price);
        }
    }
}

Jsoup.connect(url).get() does the HTTP request and returns a parsed Document. From there, select() takes any CSS selector and selectFirst() returns a single element. .text() reads inner text; .attr("href") reads an attribute. Always set a realistic userAgent — Jsoup's default identifies itself as Jsoup and is trivially blocked.

Following pagination and crawling

Most real jobs span many pages. Jsoup makes it easy to read the "next" link and follow it. Resolve relative URLs with absUrl() so links work no matter how they are written in the HTML:

String url = "https://books.toscrape.com/";
while (url != null) {
    Document doc = Jsoup.connect(url).userAgent("Mozilla/5.0").get();

    for (Element book : doc.select("article.product_pod")) {
        System.out.println(book.selectFirst("h3 > a").attr("title"));
    }

    Element next = doc.selectFirst("li.next > a");
    url = (next != null) ? next.absUrl("href") : null;   // null stops the loop
}

For large crawls, fetch pages in parallel. Java 21 virtual threads make this almost free — one Executors.newVirtualThreadPerTaskExecutor() can run thousands of concurrent fetches without exhausting OS threads. Throttle politely so you do not hammer the target.

Scraping JavaScript-rendered pages with Selenium

Jsoup only sees the HTML the server returns — it does not run JavaScript. For pages that render content client-side, drive a real browser with Selenium. Since Selenium 4.6, Selenium Manager downloads the matching driver automatically:

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import java.util.List;

public class DynamicScraper {
    public static void main(String[] args) {
        ChromeOptions options = new ChromeOptions();
        options.addArguments("--headless=new");
        WebDriver driver = new ChromeDriver(options);   // driver auto-managed
        try {
            driver.get("https://quotes.toscrape.com/js/");
            List<WebElement> quotes = driver.findElements(By.cssSelector(".quote"));
            for (WebElement q : quotes) {
                String text = q.findElement(By.cssSelector(".text")).getText();
                String author = q.findElement(By.cssSelector(".author")).getText();
                System.out.println(author + ": " + text);
            }
        } finally {
            driver.quit();
        }
    }
}

HtmlUnit is a lighter, headless GUI-less alternative that runs some JavaScript without a real browser, and Playwright for Java is the modern, faster option. But every browser-driving approach is heavier and more detectable than a plain HTTP fetch.

Which Java scraping library should you use?

LibraryTypeRuns JS?Best for
JsoupFetch + parseNoStatic pages — the default choice
HttpClientHTTP client (JDK)NoAPIs, custom requests, async
HtmlUnitHeadless browserPartialLight JS without a real browser
SeleniumBrowser automationYesFull JS rendering, widest docs
Playwright (Java)Browser automationYesModern JS pages, faster than Selenium

For 90% of jobs, Jsoup alone is enough. Add a browser tool only when the data is genuinely rendered by JavaScript.

The hard part: handling anti-bot blocking

Java code is rarely the reason a scraper fails — anti-bot defenses are. Jsoup sends a TLS handshake and header set that anti-bot systems (Cloudflare, DataDome, Akamai) recognise as non-browser traffic, and headless Selenium leaks automation signals. You cannot parse a 403 or a CAPTCHA page.

Handling this means rotating residential proxies, matching a real browser TLS fingerprint — a project of its own. A managed scraping API handles all of that server-side; your Java code just POSTs the target URL and parses the returned HTML with Jsoup as usual:

Code example

java
import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;

public class ScrapingApiScraper {
    public static void main(String[] args) throws Exception {
        String payload = """
            {"cmd": "request.get", "url": "https://example.com/protected"}
            """;

        HttpRequest req = HttpRequest.newBuilder()
            .uri(URI.create("https://api.your-scraping-provider.com/v1?key=YOUR_API_KEY"))
            .header("Content-Type", "application/json")
            .POST(HttpRequest.BodyPublishers.ofString(payload))
            .build();

        HttpResponse<String> resp = HttpClient.newHttpClient()
            .send(req, HttpResponse.BodyHandlers.ofString());

        // resp.body() holds JSON; the rendered HTML is at solution.response.
        // Parse it with Jsoup exactly as you would a normal page.
        System.out.println(resp.body());
    }
}

Related terms

Concept map

How Web Scraping With Java: A Complete 2026 Guide connects

The terms most directly tied to this one. Hover a node to see its neighbours, click to preview, drag to rearrange.

0 terms · 0 connections
You are here · Web Scraping by Language
Building map…

Frequently asked questions

What is the best library for web scraping with Java?

Jsoup is the best default — it fetches and parses HTML with CSS selectors in a single dependency and covers most static sites. Add Selenium or Playwright for Java only when the page renders its content with JavaScript. For raw HTTP control and async requests, the JDK built-in java.net.http.HttpClient pairs well with Jsoup for parsing.

Can Java scrape JavaScript-rendered websites?

Yes, but not with Jsoup alone — Jsoup does not execute JavaScript. Use Selenium (which auto-manages its browser driver since version 4.6), Playwright for Java, or HtmlUnit for lighter cases. Alternatively, find the JSON API the page calls and request it directly with HttpClient, which is faster than driving a browser.

Why does my Java scraper get blocked, and how do I fix it?

Set a realistic User-Agent (never Jsoup’s default), throttle your request rate, and rotate residential proxies. Against serious anti-bot vendors you also need a real browser TLS fingerprint , which is hard to maintain in Java directly — many teams route hard targets through a scraping API that handles proxies, fingerprinting, and challenges server-side.

Is Java good for web scraping compared to Python?

Java is excellent for large, long-running, production scrapers thanks to its speed, strong typing, and first-class concurrency (virtual threads in Java 21). Python wins on ecosystem breadth and quick scripting. If you already run a JVM stack or need high-throughput concurrent crawling, Java is a very solid choice.

Last updated: 2026-06-08