Improve performance of Foreign memory and functions bindings in Java

Contents

Introduction

I have been working on creating bindings for Liburing. To get the performance at a reasonable level I had to do some tricks that I wanted to share. In a nutshell:

You want to call the C code as little as possible.
You want to prevent copying bytes from and to Java
Manage memory yourself

Before you start trying out these tips in your own application it is handy to have a baseline of what the performance is right now. So you can compare it to the application's performance after making the changes.

Let's cover these topics from start to bottom, and see how you can use them in your performance journey! s

⚠️ WARNING
Running JMH benchmarks can give you a skewed view of what is code is actually faster. To get reliable results you need reliable benchmarks with real world data. If it doesn't match the real workload it is hard to know if one piece of code is faster than another.
The findings in this post came from my Liburing bindings project so these tips work great for my workload and use cases, and I hope they will help you too. Please test and verify when you copy these practices that your code is still behaving as expected and faster.

Prevent needless copying between Java and C

The foreign memory and function API make it incredibly easy to create bindings to some C code. If you are using Jextract it gets even easier. These are great tools, but these tools don't know what your specific use case is or what the call pattern is of your C code. We have some easy wins here. What we are looking for is code with the following pattern:

Calls a C method with a return value
The return value is turned into a String, object, etc.
Another C method needs the previous return statement
using arena.allocate this value is copied back into native memory.

Instead of translating the object into something that Java knows it would be better to just pass the address or memory segment to the next C down call.

Making method handles static final

Another tip is to make your method handles static. In the following example you can see how library is loaded, and two method handles are created.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
private static final MethodHandle malloc;
private static final MethodHandle free;

 static {
        SymbolLookup SYMBOL_LOOKUP = SymbolLookup.libraryLookup("path/to/a/library", Arena.global());
        Linker LINKER = Linker.nativeLinker();
        // you could also use the default lookup of the nativeLinker
        malloc = LINKER.downcallHandle(
                SYMBOL_LOOKUP.find("malloc").orElseThrow(),
                FunctionDescriptor.of(ADDRESS, JAVA_INT)
        );
        
        close = LINKER.downcallHandle(
                SYMBOL_LOOKUP.find("close").orElseThrow(),
                FunctionDescriptor.ofVoid(ADDRESS)
        );
    }

Using static final method handles can shave off a few nanoseconds.

Creating a wrapper

A down call is quite fast, but there is still some handover between Java and C when a call is made. Return values need to be passed to Java, and parameters must be passed down to C. This isn't the biggest time saver but could save you some microseconds. To perform fewer down calls and to make the code easier (you need fewer down call handles) you create a wrapper. You pass all the arguments in one go and the wrapper runs multiple methods.

Like in the following example. Instead of doing multiple down calls, you would only need one that runs multiple C methods.

1
2
3
4
5
6
7
8
int* read_with_offset_buffer(struct io_uring *ring, int fd, int size, void *user_data, int offset) {   
    int* buff = malloc(size);
    struct io_uring_sqe *sqe;
    sqe = io_uring_get_sqe(ring);
    io_uring_prep_read(sqe, fd, buff, size, offset);
    io_uring_sqe_set_data(sqe, user_data);
    return buff;
}

If you did this in Java you need to do 3 down calls and 1 memory allocation, this takes time. By creating a wrapper you only need to do a single down call.

Managing memory yourself

When you use arena.allocate() you will get a MemorySegment that is filled with zero's segment.fill((byte)0);. While this gives you predictable memorySegments it's a costly operation. You get zeroed out memory when you create a MemorySegment like this:

1
2
3
4
try(var arena = Arena.ofConfined()){
    MemorySegment allocate = arena.allocate(ValueLayout.JAVA_BYTE, 512);
    ... your code
}

Memory is allocated and zeroed at the start and freed when exiting the try. If you don't need zeroed out memory you can also use Malloc and free. This will allocate native memory without zeroing it out.

When you allocate 1024 bytes or fewer arena.allocate is only twice as slow. The more you allocate the slower it gets as you see in the following output.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Benchmark                                (size)  Mode  Cnt     Score    Error  Units
BenchMarkMemoryAllocation.MallocAndFree       2  avgt    5    23.431 ±  1.062  ns/op
BenchMarkMemoryAllocation.MallocAndFree       4  avgt    5    23.669 ±  2.581  ns/op
BenchMarkMemoryAllocation.MallocAndFree       8  avgt    5    24.221 ±  0.333  ns/op
BenchMarkMemoryAllocation.MallocAndFree      16  avgt    5    23.637 ±  1.497  ns/op
BenchMarkMemoryAllocation.MallocAndFree      32  avgt    5    24.068 ±  1.392  ns/op
BenchMarkMemoryAllocation.MallocAndFree      64  avgt    5    23.842 ±  0.925  ns/op
BenchMarkMemoryAllocation.MallocAndFree     128  avgt    5    24.797 ±  0.452  ns/op
BenchMarkMemoryAllocation.MallocAndFree     256  avgt    5    23.932 ±  1.000  ns/op
BenchMarkMemoryAllocation.MallocAndFree     512  avgt    5    23.952 ±  1.194  ns/op
BenchMarkMemoryAllocation.MallocAndFree    1024  avgt    5    24.046 ±  0.259  ns/op
BenchMarkMemoryAllocation.MallocAndFree    2048  avgt    5    56.243 ±  2.886  ns/op
BenchMarkMemoryAllocation.MallocAndFree    4096  avgt    5    55.448 ±  1.906  ns/op
BenchMarkMemoryAllocation.MallocAndFree    8192  avgt    5    52.743 ±  2.089  ns/op
BenchMarkMemoryAllocation.MallocAndFree   16384  avgt    5    51.991 ±  2.742  ns/op
BenchMarkMemoryAllocation.MallocAndFree   32768  avgt    5    51.423 ±  1.909  ns/op
BenchMarkMemoryAllocation.MallocAndFree   65536  avgt    5    52.887 ±  1.960  ns/op
BenchMarkMemoryAllocation.arenaAllocate       2  avgt    5    42.933 ±  1.649  ns/op
BenchMarkMemoryAllocation.arenaAllocate       4  avgt    5    42.678 ±  0.849  ns/op
BenchMarkMemoryAllocation.arenaAllocate       8  avgt    5    54.939 ±  4.880  ns/op
BenchMarkMemoryAllocation.arenaAllocate      16  avgt    5    43.455 ±  1.653  ns/op
BenchMarkMemoryAllocation.arenaAllocate      32  avgt    5    53.363 ±  1.619  ns/op
BenchMarkMemoryAllocation.arenaAllocate      64  avgt    5    56.485 ±  2.288  ns/op
BenchMarkMemoryAllocation.arenaAllocate     128  avgt    5    44.451 ±  0.433  ns/op
BenchMarkMemoryAllocation.arenaAllocate     256  avgt    5    48.916 ±  3.241  ns/op
BenchMarkMemoryAllocation.arenaAllocate     512  avgt    5    55.695 ±  0.766  ns/op
BenchMarkMemoryAllocation.arenaAllocate    1024  avgt    5    68.679 ±  1.587  ns/op
BenchMarkMemoryAllocation.arenaAllocate    2048  avgt    5   146.989 ±  3.909  ns/op
BenchMarkMemoryAllocation.arenaAllocate    4096  avgt    5   195.228 ±  9.055  ns/op
BenchMarkMemoryAllocation.arenaAllocate    8192  avgt    5   319.040 ± 17.758  ns/op
BenchMarkMemoryAllocation.arenaAllocate   16384  avgt    5   571.871 ± 33.375  ns/op
BenchMarkMemoryAllocation.arenaAllocate   32768  avgt    5  1023.374 ± 33.790  ns/op
BenchMarkMemoryAllocation.arenaAllocate   65536  avgt    5  1902.465 ±  3.586  ns/op

As you can see Malloc and free stay steady at around 24 and 55ns. Using an arena takes more time when you need more bytes going up all the way to 1902 ns to allocate 65536 bytes. While malloc only needed 52 ns.

So when to use arena.allocate or malloc?

Arena positives:

Memory is managed for you
Is part of the JDK
Memory is zeroed at the start

Arena downsides:

Slow (if you don't need zeroed memory)
You cannot free a single allocation yourself, only the entire arena.

Malloc positives:

Fast
Consistent speed
You can free it when you need to

Malloc downsides:

You need to manage memory yourself
Not part of the JDK

Skipping Arena's allocateFrom()

Does that mean you should use Malloc everywhere? Definitely not. Take a look at the following result of allocating and copying a String into native memory. Arena is part of the JDK so it has access to fields and classes we developers don't have access to. Meaning that we need to do more effort to achieve the same.

1
2
3
Benchmark                                      Mode  Cnt   Score   Error  Units
StringAllocationBenchmark.MallocString         avgt    5  43.322 ± 3.282  ns/op
StringAllocationBenchmark.arenaAllocateString  avgt    5  41.110 ± 0.531  ns/op

While malloc is faster than allocate, all the extra effort of copying a String makes it slower.

Conclusion

These are some tricks I used when creating bindings for Liburing and needed to outperform fileChannels. In this post we looked at how to prevent useless copying and translating between Java and C, Create wrappers to limit the number of down calls, and managing memory allocation yourself can really save time.

Got another tip or trick? Or didn't something work for you? Feel free to reach out!

Source

Here you can find the source code for this post: https://github.com/davidtos/BenchmarkUnsafeVsPanama