Web crawler with virtual threads

Contents

Introduction

This article will show you a simple web crawler that uses virtual threads. The web crawler will fetch web pages and extract new URLs from them to crawl. I am using virtual threads because they are cheaper to make, and I can have more of them simultaneously. Virtual threads also make blocking very cheap; it's not a problem, for example, when a thread has to wait for a response from a webserver. If you want to learn more about virtual threads please see this post.

Building the scraper with virtual threads

The idea for this crawler is to have one virtual thread for each URL, so other threads can run when a thread is blocked while waiting for a web page. In the code below, you see the entire web crawler class. In the start method, we have a while loop that gets a URI from a deque and submits it to an ExecutorService.

Virtual threads make this crawler a bit more special than other crawlers that use the older (systems) threads. At line 9, we have a try statement with two executor services. I used one executor service for the requests the HttpClient sends and one for finding URLs in the HTTP response. The order of the executor services in the try statement is important because of ordered cancellation . We can't close the executorService that the HttpClient uses before closing the executorService that processes the response.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
public class WebCrawler {
    Pattern UrlRegex = Pattern.compile("[-a-zA-Z\\d@:%._+~#=]{1,256}\\.[a-zA-Z\\d()]{1,6}\\b([-a-zA-Z\\d()@:%_+.~#?&/=]*)");
    Set<URI> foundURIs = new HashSet<>();
    LinkedBlockingDeque<URI> deque = new LinkedBlockingDeque<>();

    public void start(URI startURI) {
        deque.add(startURI);

        try (ExecutorService httpClientExecutorService = Executors.newVirtualThreadPerTaskExecutor();
             ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor()) {

            HttpClient client = HttpClient.newBuilder()
                    .followRedirects(HttpClient.Redirect.ALWAYS)
                    .connectTimeout(Duration.ofSeconds(1))
                    .executor(httpClientExecutorService)
                    .build();

            while (foundURIs.size() < 5) {
                try {
                    URI uri = deque.take();
                    System.out.println("uri = " + uri);
                    executor.submit(() -> crawl(uri, client));
                } catch (InterruptedException e) {
                    throw new RuntimeException(e);
                }
            }
        }
        System.out.println("foundURIs = " + foundURIs);
    }

    private void crawl(URI uri, HttpClient client) {

        var request = HttpRequest.newBuilder()
                .uri(uri)
                .GET()
                .build();

        try {
            var response = client.send(request, HttpResponse.BodyHandlers.ofString());

            UrlRegex.matcher(response.body())
                    .results()
                    .map(m -> m.group(0))
                    .map(s -> response.uri().resolve(s))
                    .forEach(s -> {
                        if (foundURIs.add(s)) {
                            deque.add(s);
                        }
                    });
        } catch (Exception e) {
            System.out.println("Failed to parse URI: " + uri);
        }
    }
}

To start the web crawler, you only have to create an of the class instance and call the start() method with an initial URL that it can use.

1
2
WebCrawler webCrawler = new WebCrawler();
webCrawler.start(URI.create("https://www.davidvlijmincx.com/"));

Conclusion

We looked at a simple web crawler that uses virtual threads in this post. We went over how it works and where the threads are managed and created. We also saw a case where the order of executors is essential because of ordered cancellation.

References and further reading