Virtual vs Platform Threads for API Calls

Contents

This post dives into virtual threads and explores if they are a silver bullet for making a high volume of API calls.

Intro

The application I am working on performs a significant number of concurrent REST calls (over 10_000) with minimal processing required on the response. This scenario seems ideal for virtual threads. Let's analyze how they perform application and if the switch is worthwhile.

Background

Virtual threads introduce an abstraction layer on top of traditional platform threads. They leverage carrier threads (essentially platform threads) but differ in how they handle blocking operations (like waiting for a response). When a virtual thread encounters a block, it unmounts from the carrier, allowing the carrier to pick up other virtual threads, and boosting hardware utilization. This flexibility comes with some scheduling overhead. The goal is to identify when virtual threads offer a clear advantage.

JEP-444 highlights the key benefits:

To put it another way, virtual threads can significantly improve application throughput when
The number of concurrent tasks is high (more than a few thousand), and
The workload is not CPU-bound, since having many more threads than processor cores cannot improve throughput in that case.

The application I am working on checks the health of service via REST requests, making it non-CPU-bound as it takes some time to get responses, and there are enough URLs to generate thousands of virtual threads. This suggests virtual threads could be a good fit.

Setup

The benchmark setup involves two machines: one to run the benchmark and another acting as a server to receive requests. The goal is to measure the time it takes to process 10,000 tasks, each involving one or two API calls. The server endpoint simulates different response times using a path variable.

The requests are made to the following end-point:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
@GetMapping("/delay/{t}")
String youChoseTheDelay(@PathVariable int t){

    try {
        Thread.sleep(t);
    } catch (InterruptedException e) {
        throw new RuntimeException(e);
    }

    return generateHtmlPageWithUrls(100, "crawl/delay/");
}

Getting a response from this end-point takes around 4ms. With the thread.sleep I can add more delay to this end-point.

Using a path variable it is possible to delay a response from the server. This makes benchmarking different response times a lot easier.

The benchmark is created with JMH. The goal is to submit 10_000 tasks to the newVirtualThreadPerTaskExecutor and newFixedThreadPool and see which one takes the least time on average to run. There are a few combinations of parameters I tested. So the number of benchmarks is:

2 types of executors (virtual and fixed pool of platform threads)
tasks with 1 or 2 API calls
0,1,2,3,4, or 5ms of extra delay.

This gives me 2 * 2 * 6 = 12 possible combinations to run. For each combination, 10_000 requests are made.

This is the benchmark code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
@Fork(value = 1, jvmArgs = {"-Xms512m", "-Xmx4096m"})
@BenchmarkMode({Mode.AverageTime})
@Warmup(time = 20)
@Measurement(iterations = 10, time = 30)
@State(Scope.Benchmark)
@Timeout(time = 60)
public class LoomBenchmark {

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
                .include(LoomBenchmark.class.getSimpleName())
                .forks(1)
                .shouldDoGC(false)
                .build();

        new Runner(opt).run();
    }

    @Benchmark
    public long virtualThreadExecutor(Blackhole blackhole, ExecutionPlan plan) {
        try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
            benchmarkingMethod(blackhole, plan, executor);
        }
        return System.currentTimeMillis();
    }

    @Benchmark
    public long FixedPlatformPool(Blackhole blackhole, ExecutionPlan plan) {
        try (var executor = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors())) {
            benchmarkingMethod(blackhole, plan, executor);
        }
        return System.currentTimeMillis();
    }

    private void benchmarkingMethod(Blackhole blackhole, ExecutionPlan plan, ExecutorService executor) {
        IntStream.range(0, 10_000).forEach(i ->
                executor.submit(() -> {
                    try {
                        String response1 =  fetchURL(URI.create("http://192.168.1.17:8080/v1/crawl/delay/" + plan.delay).toURL());
                        blackhole.consume(response1);

                        if ("2".equals(plan.numberOfCalls)) {
                            String response2 =  fetchURL(URI.create("http://192.168.1.17:8080/v1/crawl/delay/" + plan.delay).toURL());
                            blackhole.consume(response2);
                        }

                    } catch (IOException e) {
                        throw new RuntimeException(e);
                    }
                })
        );
    }

    String fetchURL(URL url) throws IOException {
        try (var in = url.openStream()) {
            return new String(in.readAllBytes(), StandardCharsets.UTF_8);
        }
    }

}

This benchmark used the following execution plan:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
@State(Scope.Benchmark)
public class ExecutionPlan {

  @Param({"0", "1", "2", "3", "4", "5"})
  public int delay;

  @Param({"1", "2"})
  public String numberOfCalls;

}

JMH will run all possible combinations of these variables and print the results after running all the benchmarks.

Results

The benchmark is run with JDK 24-loom+1-17 (2024/6/22). This is an early access build of project Loom and has the latest changes to the virtual thread scheduler. I created two separate graphs one for the “1 API call” benchmark and one for the “2 API call” benchmark to make it easier to compare the two kinds of executors.

These are the results of submitting 10_000 tasks that each make 1 API call: As you can see the virtual threads are very stable at around 3 seconds. The platform threads are performing better with 0.8 seconds to perform 10_000 requests, in comparison to the 3 seconds the virtual threads need. As the delay increases so does the time the platform threads need to perform all those requests. The Virtual Threads on the other hand seem to perform a little better when requests have more delay.

These are the results of submitting 10_000 tasks that make 2 API calls:

As you can see the virtual threads stay very stable at around 3 seconds to perform 2 x 10_000 calls. Looking at the platform threads you see that till the 2ms of extra delay, they perform better than virtual threads, after that virtual threads are the clear winner.

For complete transparency, these are JMH benchmark results:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Benchmark                                   (delay)  (numberOfCalls)  Mode  Cnt  Score    Error  Units
Improv.LoomBenchmark.FixedPlatformPool            0                1  avgt   10  0.808 ±  0.001   s/op
Improv.LoomBenchmark.FixedPlatformPool            0                2  avgt   10  1.614 ±  0.001   s/op
Improv.LoomBenchmark.FixedPlatformPool            1                1  avgt   10  1.239 ±  0.002   s/op
Improv.LoomBenchmark.FixedPlatformPool            1                2  avgt   10  2.473 ±  0.004   s/op
Improv.LoomBenchmark.FixedPlatformPool            2                1  avgt   10  1.461 ±  0.003   s/op
Improv.LoomBenchmark.FixedPlatformPool            2                2  avgt   10  2.918 ±  0.004   s/op
Improv.LoomBenchmark.FixedPlatformPool            3                1  avgt   10  1.730 ±  0.002   s/op
Improv.LoomBenchmark.FixedPlatformPool            3                2  avgt   10  3.461 ±  0.008   s/op
Improv.LoomBenchmark.FixedPlatformPool            4                1  avgt   10  2.017 ±  0.003   s/op
Improv.LoomBenchmark.FixedPlatformPool            4                2  avgt   10  4.032 ±  0.007   s/op
Improv.LoomBenchmark.FixedPlatformPool            5                1  avgt   10  2.317 ±  0.003   s/op
Improv.LoomBenchmark.FixedPlatformPool            5                2  avgt   10  4.633 ±  0.006   s/op

Improv.LoomBenchmark.virtualThreadExecutor        0                1  avgt   10  2.976 ±  0.262   s/op
Improv.LoomBenchmark.virtualThreadExecutor        0                2  avgt   10  2.909 ±  0.093   s/op
Improv.LoomBenchmark.virtualThreadExecutor        1                1  avgt   10  2.853 ±  0.181   s/op
Improv.LoomBenchmark.virtualThreadExecutor        1                2  avgt   10  2.913 ±  0.146   s/op
Improv.LoomBenchmark.virtualThreadExecutor        2                1  avgt   10  2.875 ±  0.254   s/op
Improv.LoomBenchmark.virtualThreadExecutor        2                2  avgt   10  2.876 ±  0.112   s/op
Improv.LoomBenchmark.virtualThreadExecutor        3                1  avgt   10  2.772 ±  0.126   s/op
Improv.LoomBenchmark.virtualThreadExecutor        3                2  avgt   10  2.856 ±  0.196   s/op
Improv.LoomBenchmark.virtualThreadExecutor        4                1  avgt   10  2.731 ±  0.166   s/op
Improv.LoomBenchmark.virtualThreadExecutor        4                2  avgt   10  2.888 ±  0.155   s/op
Improv.LoomBenchmark.virtualThreadExecutor        5                1  avgt   10  2.855 ±  0.213   s/op
Improv.LoomBenchmark.virtualThreadExecutor        5                2  avgt   10  2.906 ±  0.172   s/op

Key Takeaways:

For single API calls with a response time under 9ms (4ms base delay + 5ms extra delay), platform threads perform better.
When making multiple calls, virtual threads show more consistent performance around 3 seconds.
Platform threads might be preferable for 2 calls if the response time of each stays below 6ms (4ms base delay + 2ms extra delay).

Remember, these are general guidelines. The optimal choice depends on your specific application's characteristics.

Virtual threads