Does your company rely on browser automation or web scraping? We have a wild offer for our early customers! Read more →

Web Scraping with Java: Step-by-Step Tutorial for 2025

published 24 days ago
by Robert Wilson

Key Takeaways

  • Java offers multiple robust libraries for web scraping, with JSoup ideal for static sites and Selenium/Playwright for dynamic content
  • Modern Java features like virtual threads (Project Loom) can significantly improve scraping performance
  • Implementing proper error handling and rate limiting is crucial for reliable scraping
  • Using headless browsers and proxy rotation helps avoid blocking
  • Building modular and maintainable scrapers requires proper architecture and design patterns

Introduction

Web scraping has become an essential skill for developers in recent years, enabling data collection for market research, price monitoring, lead generation, and more. While Python often gets the spotlight for web scraping, Java offers robust capabilities that make it an excellent choice, especially for enterprise applications. Learn more about web scraping use cases and applications.

This guide will walk you through building production-ready web scrapers in Java, from basic concepts to advanced techniques. We'll cover both theory and practice, with real-world examples and code you can start using today.

Setting Up Your Environment

Prerequisites

  • Java 21 LTS or newer (we'll use virtual threads for improved performance)
  • Maven or Gradle for dependency management
  • An IDE (IntelliJ IDEA, Eclipse, or VS Code)

Essential Libraries

Add these dependencies to your project:

For Maven (pom.xml):

<dependencies>
    <!-- JSoup for HTML parsing -->
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.17.2</version>
    </dependency>
    
    <!-- Selenium for dynamic content -->
    <dependency>
        <groupId>org.seleniumhq.selenium</groupId>
        <artifactId>selenium-java</artifactId>
        <version>4.18.1</version>
    </dependency>
</dependencies>

For Gradle (build.gradle):

dependencies {
    implementation 'org.jsoup:jsoup:1.17.2'
    implementation 'org.seleniumhq.selenium:selenium-java:4.18.1'
}

Basic Web Scraping with JSoup

Creating Your First Scraper

Let's start with a simple example that scrapes product information from an e-commerce site:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class BasicScraper {
    public static void main(String[] args) {
        try {
            // Configure request with headers to avoid blocking
            Document doc = Jsoup.connect("https://example.com/products")
                .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0")
                .header("Accept", "text/html")
                .header("Accept-Language", "en-US")
                .get();
                
            // Select all product elements
            Elements products = doc.select(".product-item");
            
            for (Element product : products) {
                String name = product.select(".product-name").text();
                String price = product.select(".product-price").text();
                String imageUrl = product.select("img").attr("src");
                
                System.out.printf("Product: %s, Price: %s, Image: %s%n", 
                    name, price, imageUrl);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Best Practices for Basic Scraping

  • Always set a user agent and appropriate headers
  • Implement error handling and retries
  • Use CSS selectors or JSoup's DOM navigation methods
  • Add delays between requests to avoid overloading servers

Advanced Scraping Techniques

Handling Dynamic Content with Selenium

Many modern websites load content dynamically through JavaScript. Here's how to handle this with different automation tools - learn more about choosing between Playwright and Selenium:

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;

public class DynamicScraper {
    public static void main(String[] args) {
        ChromeOptions options = new ChromeOptions();
        options.addArguments("--headless");
        
        try (WebDriver driver = new ChromeDriver(options)) {
            driver.get("https://example.com/dynamic-content");
            
            // Wait for dynamic content to load
            WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(10));
            wait.until(ExpectedConditions.presenceOfElementLocated(
                By.cssSelector(".dynamic-content")));
            
            // Extract data
            List elements = driver.findElements(
                By.cssSelector(".dynamic-content"));
            
            for (WebElement element : elements) {
                System.out.println(element.getText());
            }
        }
    }
}

Using Virtual Threads for Parallel Scraping

Java 21's virtual threads can significantly improve scraping performance:

public class ParallelScraper {
    public static void main(String[] args) {
        List urls = List.of(
            "https://example.com/page1",
            "https://example.com/page2",
            "https://example.com/page3"
        );
        
        try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
            List> futures = urls.stream()
                .map(url -> executor.submit(() -> scrapeUrl(url)))
                .toList();
                
            for (Future future : futures) {
                System.out.println(future.get());
            }
        }
    }
    
    private static String scrapeUrl(String url) {
        // Implement scraping logic here
        return Jsoup.connect(url).get().title();
    }
}

Avoiding Blocks and Rate Limiting

Implementing Rate Limiting

Here's a simple rate limiter implementation to help avoid rate limiting errors:

public class RateLimiter {
    private final int requestsPerSecond;
    private final Queue requestTimestamps = new LinkedList<>();
    
    public RateLimiter(int requestsPerSecond) {
        this.requestsPerSecond = requestsPerSecond;
    }
    
    public synchronized void acquire() throws InterruptedException {
        long now = System.currentTimeMillis();
        
        // Remove timestamps older than 1 second
        while (!requestTimestamps.isEmpty() && 
               now - requestTimestamps.peek() > 1000) {
            requestTimestamps.poll();
        }
        
        // If at rate limit, wait
        if (requestTimestamps.size() >= requestsPerSecond) {
            Thread.sleep(1000);
        }
        
        requestTimestamps.add(now);
    }
}

Building a Production-Ready Scraper

Architecture Best Practices

  • Separate concerns: parsing, data storage, and request handling
  • Implement proper logging and monitoring
  • Use configuration files for easily modifiable settings
  • Include comprehensive error handling and recovery

Sample Production Architecture

src/
  ├── main/
  │   ├── java/
  │   │   ├── config/
  │   │   │   ├── ScraperConfig.java
  │   │   │   └── ProxyConfig.java
  │   │   ├── model/
  │   │   │   └── ScrapedData.java
  │   │   ├── service/
  │   │   │   ├── ScraperService.java
  │   │   │   └── StorageService.java
  │   │   ├── util/
  │   │   │   ├── RateLimiter.java
  │   │   │   └── ProxyRotator.java
  │   │   └── Application.java
  │   └── resources/
  │       └── application.yml
  └── test/
      └── java/
          └── service/
              └── ScraperServiceTest.java

Common Challenges and Solutions

Challenge: JavaScript Rendering

Solution: Use Selenium or Playwright for full browser automation:

// Using Playwright for modern browser automation
try (Playwright playwright = Playwright.create()) {
    Browser browser = playwright.chromium().launch();
    Page page = browser.newPage();
    page.navigate("https://example.com");
    
    // Wait for JavaScript content
    page.waitForSelector(".dynamic-content");
    
    // Extract data
    String content = page.innerHTML(".dynamic-content");
}

Challenge: CAPTCHAs and IP Blocks

Solution: Implement proxy rotation and sophisticated request patterns:

public class ProxyRotator {
    private final List proxies;
    private int currentIndex = 0;
    
    public synchronized Proxy getNext() {
        Proxy proxy = proxies.get(currentIndex);
        currentIndex = (currentIndex + 1) % proxies.size();
        return proxy;
    }
}

Developer Experiences & Community Insights

Technical discussions across various platforms reveal interesting perspectives on Java web scraping, particularly when compared to more commonly used languages like Python and Node.js. Many senior developers emphasize that while Python might offer faster initial development for simple scraping tasks, Java's strengths become apparent in larger, enterprise-scale projects.

Engineers with hands-on experience highlight several key advantages of Java for web scraping. The language's strong typing and robust ecosystem make it particularly valuable for maintaining and expanding scraping projects over time. Several developers mention successfully using Java with modern tools like Playwright and Selenium for handling JavaScript-heavy sites, challenging the notion that Java is only suitable for basic HTML parsing.

Practical insights from the development community suggest that the choice of Java for web scraping often aligns with existing team expertise and infrastructure. Teams maintaining Spring Boot applications, for instance, report successfully integrating scraping functionality directly into their applications. Some developers have built sophisticated systems combining Java scraping with React frontends for monitoring e-commerce prices or archiving web content.

The primary debate centers around development speed versus maintainability. While many acknowledge that Python offers a quicker path to building simple scrapers, Java advocates argue that the investment in proper architecture and type safety pays off in larger projects. Several engineers specifically mention JSoup for static content and Playwright for dynamic sites as their preferred tools, noting that Java's "boring but reliable" nature becomes an advantage in production environments.

Future of Web Scraping

The landscape of web scraping is evolving rapidly. Key trends to watch in 2024 and beyond:

  • AI-powered content extraction and classification
  • Increased use of headless browsers for JavaScript-heavy sites
  • Advanced anti-bot measures requiring more sophisticated scraping techniques
  • Integration with big data pipelines and real-time processing

Conclusion

Web scraping with Java offers robust solutions for data collection needs. By following the best practices and patterns outlined in this guide, you can build reliable, scalable scrapers that handle modern web challenges. Remember to always respect websites' terms of service and implement appropriate rate limiting and error handling.

Additional Resources

Robert Wilson
Author
Robert Wilson
Senior Content Manager
Robert brings 6 years of digital storytelling experience to his role as Senior Content Manager. He's crafted strategies for both Fortune 500 companies and startups. When not working, Robert enjoys hiking the PNW trails and cooking. He holds a Master's in Digital Communication from University of Washington and is passionate about mentoring new content creators.
Try Rebrowser for free. Join our waitlist.
Due to high demand, Rebrowser is currently available by invitation only.
We're expanding our user base daily, so join our waitlist today.
Just share your email to unlock a new world of seamless automation.
Get invited within 7 days
No credit card required
No spam
Other Posts
how-to-parse-datetime-strings-with-python-and-dateparser-the-ultimate-guide
Time is tricky: A comprehensive guide to parsing datetime strings in Python using dateparser - from basic usage and real-world examples to handling complex international formats and optimizing performance.
published 3 months ago
by Nick Webson
best-unblocked-browsers-to-access-blocked-sites
Unlock the web with the best unblocked browsers! Discover top options to access restricted sites effortlessly and enjoy a free browsing experience.
published 4 months ago
by Nick Webson
how-to-access-main-context-objects-from-isolated-context-in-puppeteer-and-playwright
Unlock main context objects from isolated world in web automation. Boost your scripts' power while evading anti-bot detection. A must-read for Puppeteer and Playwright users.
published 6 months ago
by Nick Webson
node-js-fetch-api-complete-tutorial-with-examples
Learn to master Node.js Fetch API - an in-depth guide covering best practices, real-world examples, and performance optimization for modern HTTP requests. Perfect for both beginners and experienced developers looking to streamline their HTTP client code.
published 2 months ago
by Robert Wilson
playwright-vs-selenium-the-ultimate-comparison-guide-for-web-automation
A comprehensive guide to help developers and QA teams choose between Playwright and Selenium for their web automation needs in 2025. Compare features, performance, and use cases with practical examples.
published 4 months ago
by Nick Webson
how-canvas-fingerprint-blockers-make-you-easily-trackable-the-paradox-of-digital-privacy
Discover why canvas fingerprint blockers may increase your online visibility instead of protecting your privacy. Learn about effective alternatives and how to truly safeguard your digital identity.
published 7 months ago
by Robert Wilson