Introduction
This article will show you a simple web crawler that uses virtual threads. The web crawler will fetch web pages and extract new URLs from them to crawl. I am using virtual threads because they are cheaper to make, and I can have more of them simultaneously. Virtual threads also make blocking very cheap; it's not a problem, for example, when a thread has to wait for a response from a webserver. If you want to learn more about virtual threads please see this post.
Building the scraper with virtual threads
The idea for this crawler is to have one virtual thread for each URL, so other threads can run when a thread is blocked while
waiting for a web page. In the code below, you see the entire web crawler class. In the start method, we have a while
loop that gets a URI from a
deque and submits it to an ExecutorService
.
Virtual threads make this crawler a bit more special than other crawlers that use the older (systems) threads. At line 9, we have a try
statement with two executor services. I used one executor service for the requests the HttpClient sends
and one for finding URLs in the HTTP response. The order of the executor services in the try statement is important because of ordered cancellation .
We can't close the executorService
that the HttpClient uses before closing the executorService
that processes the response.
|
|
To start the web crawler, you only have to create an of the class instance and call the start()
method with an initial URL that it can use.
|
|
Conclusion
We looked at a simple web crawler that uses virtual threads in this post. We went over how it works and where the threads are managed and created. We also saw a case where the order of executors is essential because of ordered cancellation.