Does your company rely on browser automation or web scraping? We have a wild offer for our early customers! Read more →

Web Scraping with Java: Step-by-Step Tutorial for 2025

published a month ago
by Robert Wilson

Key Takeaways

  • Java offers multiple robust libraries for web scraping, with JSoup ideal for static sites and Selenium/Playwright for dynamic content
  • Modern Java features like virtual threads (Project Loom) can significantly improve scraping performance
  • Implementing proper error handling and rate limiting is crucial for reliable scraping
  • Using headless browsers and proxy rotation helps avoid blocking
  • Building modular and maintainable scrapers requires proper architecture and design patterns

Introduction

Web scraping has become an essential skill for developers in recent years, enabling data collection for market research, price monitoring, lead generation, and more. While Python often gets the spotlight for web scraping, Java offers robust capabilities that make it an excellent choice, especially for enterprise applications. Learn more about web scraping use cases and applications.

This guide will walk you through building production-ready web scrapers in Java, from basic concepts to advanced techniques. We'll cover both theory and practice, with real-world examples and code you can start using today.

Setting Up Your Environment

Prerequisites

  • Java 21 LTS or newer (we'll use virtual threads for improved performance)
  • Maven or Gradle for dependency management
  • An IDE (IntelliJ IDEA, Eclipse, or VS Code)

Essential Libraries

Add these dependencies to your project:

For Maven (pom.xml):

<dependencies>
    <!-- JSoup for HTML parsing -->
    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.17.2</version>
    </dependency>
    
    <!-- Selenium for dynamic content -->
    <dependency>
        <groupId>org.seleniumhq.selenium</groupId>
        <artifactId>selenium-java</artifactId>
        <version>4.18.1</version>
    </dependency>
</dependencies>

For Gradle (build.gradle):

dependencies {
    implementation 'org.jsoup:jsoup:1.17.2'
    implementation 'org.seleniumhq.selenium:selenium-java:4.18.1'
}

Basic Web Scraping with JSoup

Creating Your First Scraper

Let's start with a simple example that scrapes product information from an e-commerce site:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class BasicScraper {
    public static void main(String[] args) {
        try {
            // Configure request with headers to avoid blocking
            Document doc = Jsoup.connect("https://example.com/products")
                .userAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/120.0.0.0")
                .header("Accept", "text/html")
                .header("Accept-Language", "en-US")
                .get();
                
            // Select all product elements
            Elements products = doc.select(".product-item");
            
            for (Element product : products) {
                String name = product.select(".product-name").text();
                String price = product.select(".product-price").text();
                String imageUrl = product.select("img").attr("src");
                
                System.out.printf("Product: %s, Price: %s, Image: %s%n", 
                    name, price, imageUrl);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Best Practices for Basic Scraping

  • Always set a user agent and appropriate headers
  • Implement error handling and retries
  • Use CSS selectors or JSoup's DOM navigation methods
  • Add delays between requests to avoid overloading servers

Advanced Scraping Techniques

Handling Dynamic Content with Selenium

Many modern websites load content dynamically through JavaScript. Here's how to handle this with different automation tools - learn more about choosing between Playwright and Selenium:

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;

public class DynamicScraper {
    public static void main(String[] args) {
        ChromeOptions options = new ChromeOptions();
        options.addArguments("--headless");
        
        try (WebDriver driver = new ChromeDriver(options)) {
            driver.get("https://example.com/dynamic-content");
            
            // Wait for dynamic content to load
            WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(10));
            wait.until(ExpectedConditions.presenceOfElementLocated(
                By.cssSelector(".dynamic-content")));
            
            // Extract data
            List elements = driver.findElements(
                By.cssSelector(".dynamic-content"));
            
            for (WebElement element : elements) {
                System.out.println(element.getText());
            }
        }
    }
}

Using Virtual Threads for Parallel Scraping

Java 21's virtual threads can significantly improve scraping performance:

public class ParallelScraper {
    public static void main(String[] args) {
        List urls = List.of(
            "https://example.com/page1",
            "https://example.com/page2",
            "https://example.com/page3"
        );
        
        try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
            List> futures = urls.stream()
                .map(url -> executor.submit(() -> scrapeUrl(url)))
                .toList();
                
            for (Future future : futures) {
                System.out.println(future.get());
            }
        }
    }
    
    private static String scrapeUrl(String url) {
        // Implement scraping logic here
        return Jsoup.connect(url).get().title();
    }
}

Avoiding Blocks and Rate Limiting

Implementing Rate Limiting

Here's a simple rate limiter implementation to help avoid rate limiting errors:

public class RateLimiter {
    private final int requestsPerSecond;
    private final Queue requestTimestamps = new LinkedList<>();
    
    public RateLimiter(int requestsPerSecond) {
        this.requestsPerSecond = requestsPerSecond;
    }
    
    public synchronized void acquire() throws InterruptedException {
        long now = System.currentTimeMillis();
        
        // Remove timestamps older than 1 second
        while (!requestTimestamps.isEmpty() && 
               now - requestTimestamps.peek() > 1000) {
            requestTimestamps.poll();
        }
        
        // If at rate limit, wait
        if (requestTimestamps.size() >= requestsPerSecond) {
            Thread.sleep(1000);
        }
        
        requestTimestamps.add(now);
    }
}

Building a Production-Ready Scraper

Architecture Best Practices

  • Separate concerns: parsing, data storage, and request handling
  • Implement proper logging and monitoring
  • Use configuration files for easily modifiable settings
  • Include comprehensive error handling and recovery

Sample Production Architecture

src/
  ├── main/
  │   ├── java/
  │   │   ├── config/
  │   │   │   ├── ScraperConfig.java
  │   │   │   └── ProxyConfig.java
  │   │   ├── model/
  │   │   │   └── ScrapedData.java
  │   │   ├── service/
  │   │   │   ├── ScraperService.java
  │   │   │   └── StorageService.java
  │   │   ├── util/
  │   │   │   ├── RateLimiter.java
  │   │   │   └── ProxyRotator.java
  │   │   └── Application.java
  │   └── resources/
  │       └── application.yml
  └── test/
      └── java/
          └── service/
              └── ScraperServiceTest.java

Common Challenges and Solutions

Challenge: JavaScript Rendering

Solution: Use Selenium or Playwright for full browser automation:

// Using Playwright for modern browser automation
try (Playwright playwright = Playwright.create()) {
    Browser browser = playwright.chromium().launch();
    Page page = browser.newPage();
    page.navigate("https://example.com");
    
    // Wait for JavaScript content
    page.waitForSelector(".dynamic-content");
    
    // Extract data
    String content = page.innerHTML(".dynamic-content");
}

Challenge: CAPTCHAs and IP Blocks

Solution: Implement proxy rotation and sophisticated request patterns:

public class ProxyRotator {
    private final List proxies;
    private int currentIndex = 0;
    
    public synchronized Proxy getNext() {
        Proxy proxy = proxies.get(currentIndex);
        currentIndex = (currentIndex + 1) % proxies.size();
        return proxy;
    }
}

Developer Experiences & Community Insights

Technical discussions across various platforms reveal interesting perspectives on Java web scraping, particularly when compared to more commonly used languages like Python and Node.js. Many senior developers emphasize that while Python might offer faster initial development for simple scraping tasks, Java's strengths become apparent in larger, enterprise-scale projects.

Engineers with hands-on experience highlight several key advantages of Java for web scraping. The language's strong typing and robust ecosystem make it particularly valuable for maintaining and expanding scraping projects over time. Several developers mention successfully using Java with modern tools like Playwright and Selenium for handling JavaScript-heavy sites, challenging the notion that Java is only suitable for basic HTML parsing.

Practical insights from the development community suggest that the choice of Java for web scraping often aligns with existing team expertise and infrastructure. Teams maintaining Spring Boot applications, for instance, report successfully integrating scraping functionality directly into their applications. Some developers have built sophisticated systems combining Java scraping with React frontends for monitoring e-commerce prices or archiving web content.

The primary debate centers around development speed versus maintainability. While many acknowledge that Python offers a quicker path to building simple scrapers, Java advocates argue that the investment in proper architecture and type safety pays off in larger projects. Several engineers specifically mention JSoup for static content and Playwright for dynamic sites as their preferred tools, noting that Java's "boring but reliable" nature becomes an advantage in production environments.

Future of Web Scraping

The landscape of web scraping is evolving rapidly. Key trends to watch in 2024 and beyond:

  • AI-powered content extraction and classification
  • Increased use of headless browsers for JavaScript-heavy sites
  • Advanced anti-bot measures requiring more sophisticated scraping techniques
  • Integration with big data pipelines and real-time processing

Conclusion

Web scraping with Java offers robust solutions for data collection needs. By following the best practices and patterns outlined in this guide, you can build reliable, scalable scrapers that handle modern web challenges. Remember to always respect websites' terms of service and implement appropriate rate limiting and error handling.

Additional Resources

Robert Wilson
Author
Robert Wilson
Senior Content Manager
Robert brings 6 years of digital storytelling experience to his role as Senior Content Manager. He's crafted strategies for both Fortune 500 companies and startups. When not working, Robert enjoys hiking the PNW trails and cooking. He holds a Master's in Digital Communication from University of Washington and is passionate about mentoring new content creators.
Try Rebrowser for free. Join our waitlist.
Due to high demand, Rebrowser is currently available by invitation only.
We're expanding our user base daily, so join our waitlist today.
Just share your email to unlock a new world of seamless automation.
Get invited within 7 days
No credit card required
No spam
Other Posts
lxml-tutorial-advanced-xml-and-html-processing
Efficiently parse and manipulate XML/HTML documents using Python's LXML library. Learn advanced techniques, performance optimization, and practical examples for web scraping and data processing. Complete guide for beginners and experienced developers alike.
published 2 months ago
by Nick Webson
web-scraping-with-php-modern-tools-and-best-practices-for-data-extraction
Master PHP web scraping with this comprehensive guide covering modern libraries, ethical considerations, and real-world examples. Perfect for both beginners and experienced developers.
published 2 months ago
by Nick Webson
what-is-ip-leak-understanding-preventing-and-protecting-your-online-privacy
Discover what IP leaks are, how they occur, and effective ways to protect your online privacy. Learn about VPNs, proxy servers, and advanced solutions like Rebrowser for maintaining anonymity online.
published 8 months ago
by Nick Webson
python-json-parsing-a-developers-practical-guide-with-real-world-examples
Efficiently handle JSON data in Python with practical code examples and best practices for modern applications. Learn parsing, validation, and performance optimization techniques.
published 2 months ago
by Nick Webson
web-scraping-with-go-a-practical-guide-from-basics-to-production
Master web scraping with Go: Learn how to build efficient scrapers using Colly and other tools. Includes real-world examples, best practices, and advanced techniques for production deployment.
published 2 months ago
by Nick Webson
how-to-fix-runtime-enable-cdp-detection-of-puppeteer-playwright-and-other-automation-libraries
Here's the story of how we fixed Puppeteer to avoid the Runtime.Enable leak - a trick used by all major anti-bot companies. We dove deep into the code, crafted custom patches, and emerged with a solution that keeps automation tools running smoothly under the radar.
published 7 months ago
by Nick Webson