Your first Java scraper with Jsoup
Jsoup is the workhorse of Java scraping: it fetches a page, parses the HTML into a tree, and lets you select elements with CSS selectors — all in one dependency. Add it with Maven (org.jsoup:jsoup:1.22.2) or Gradle, then:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class BookScraper {
public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("https://books.toscrape.com/")
.userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64)")
.timeout(10_000)
.get();
Elements books = doc.select("article.product_pod");
for (Element book : books) {
String title = book.selectFirst("h3 > a").attr("title");
String price = book.selectFirst(".price_color").text();
System.out.println(title + " | " + price);
}
}
}Jsoup.connect(url).get() does the HTTP request and returns a parsed Document. From there, select() takes any CSS selector and selectFirst() returns a single element. .text() reads inner text; .attr("href") reads an attribute. Always set a realistic userAgent — Jsoup's default identifies itself as Jsoup and is trivially blocked.
Following pagination and crawling
Most real jobs span many pages. Jsoup makes it easy to read the "next" link and follow it. Resolve relative URLs with absUrl() so links work no matter how they are written in the HTML:
String url = "https://books.toscrape.com/";
while (url != null) {
Document doc = Jsoup.connect(url).userAgent("Mozilla/5.0").get();
for (Element book : doc.select("article.product_pod")) {
System.out.println(book.selectFirst("h3 > a").attr("title"));
}
Element next = doc.selectFirst("li.next > a");
url = (next != null) ? next.absUrl("href") : null; // null stops the loop
}For large crawls, fetch pages in parallel. Java 21 virtual threads make this almost free — one Executors.newVirtualThreadPerTaskExecutor() can run thousands of concurrent fetches without exhausting OS threads. Throttle politely so you do not hammer the target.
Scraping JavaScript-rendered pages with Selenium
Jsoup only sees the HTML the server returns — it does not run JavaScript. For pages that render content client-side, drive a real browser with Selenium. Since Selenium 4.6, Selenium Manager downloads the matching driver automatically:
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import java.util.List;
public class DynamicScraper {
public static void main(String[] args) {
ChromeOptions options = new ChromeOptions();
options.addArguments("--headless=new");
WebDriver driver = new ChromeDriver(options); // driver auto-managed
try {
driver.get("https://quotes.toscrape.com/js/");
List<WebElement> quotes = driver.findElements(By.cssSelector(".quote"));
for (WebElement q : quotes) {
String text = q.findElement(By.cssSelector(".text")).getText();
String author = q.findElement(By.cssSelector(".author")).getText();
System.out.println(author + ": " + text);
}
} finally {
driver.quit();
}
}
}HtmlUnit is a lighter, headless GUI-less alternative that runs some JavaScript without a real browser, and Playwright for Java is the modern, faster option. But every browser-driving approach is heavier and more detectable than a plain HTTP fetch.
Which Java scraping library should you use?
| Library | Type | Runs JS? | Best for |
|---|---|---|---|
| Jsoup | Fetch + parse | No | Static pages — the default choice |
| HttpClient | HTTP client (JDK) | No | APIs, custom requests, async |
| HtmlUnit | Headless browser | Partial | Light JS without a real browser |
| Selenium | Browser automation | Yes | Full JS rendering, widest docs |
| Playwright (Java) | Browser automation | Yes | Modern JS pages, faster than Selenium |
For 90% of jobs, Jsoup alone is enough. Add a browser tool only when the data is genuinely rendered by JavaScript.
The hard part: handling anti-bot blocking
Java code is rarely the reason a scraper fails — anti-bot defenses are. Jsoup sends a TLS handshake and header set that anti-bot systems (Cloudflare, DataDome, Akamai) recognise as non-browser traffic, and headless Selenium leaks automation signals. You cannot parse a 403 or a CAPTCHA page.
Handling this means rotating residential proxies, matching a real browser TLS fingerprint — a project of its own. A managed scraping API handles all of that server-side; your Java code just POSTs the target URL and parses the returned HTML with Jsoup as usual: