Introduction
I have been working on creating bindings for Liburing. To get the performance at a reasonable level I had to do some tricks that I wanted to share. In a nutshell:
- You want to call the C code as little as possible.
- You want to prevent copying bytes from and to Java
- Manage memory yourself
Before you start trying out these tips in your own application it is handy to have a baseline of what the performance is right now. So you can compare it to the application's performance after making the changes.
Let's cover these topics from start to bottom, and see how you can use them in your performance journey! s
⚠️ WARNING
Running JMH benchmarks can give you a skewed view of what is code is actually faster. To get reliable results you need reliable benchmarks with real world data. If it doesn't match the real workload it is hard to know if one piece of code is faster than another.
The findings in this post came from my Liburing bindings project so these tips work great for my workload and use cases, and I hope they will help you too. Please test and verify when you copy these practices that your code is still behaving as expected and faster.
Prevent needless copying between Java and C
The foreign memory and function API make it incredibly easy to create bindings to some C code. If you are using Jextract it gets even easier. These are great tools, but these tools don't know what your specific use case is or what the call pattern is of your C code. We have some easy wins here. What we are looking for is code with the following pattern:
- Calls a C method with a return value
- The return value is turned into a String, object, etc.
- Another C method needs the previous return statement
- using
arena.allocate
this value is copied back into native memory.
Instead of translating the object into something that Java knows it would be better to just pass the address or memory segment to the next C down call.
Making method handles static final
Another tip is to make your method handles static. In the following example you can see how library is loaded, and two method handles are created.
|
|
Using static final method handles can shave off a few nanoseconds.
Creating a wrapper
A down call is quite fast, but there is still some handover between Java and C when a call is made. Return values need to be passed to Java, and parameters must be passed down to C. This isn't the biggest time saver but could save you some microseconds. To perform fewer down calls and to make the code easier (you need fewer down call handles) you create a wrapper. You pass all the arguments in one go and the wrapper runs multiple methods.
Like in the following example. Instead of doing multiple down calls, you would only need one that runs multiple C methods.
|
|
If you did this in Java you need to do 3 down calls and 1 memory allocation, this takes time. By creating a wrapper you only need to do a single down call.
Managing memory yourself
When you use arena.allocate() you will get a MemorySegment that is filled with zero's segment.fill((byte)0);
. While this gives you predictable memorySegments it's a costly operation.
You get zeroed out memory when you create a MemorySegment like this:
|
|
Memory is allocated and zeroed at the start and freed when exiting the try. If you don't need zeroed out memory you can also use Malloc
and free
.
This will allocate native memory without zeroing it out.
When you allocate 1024 bytes or fewer arena.allocate
is only twice as slow. The more you allocate the slower it gets as
you see in the following output.
|
|
As you can see Malloc and free stay steady at around 24 and 55ns. Using an arena takes more time when you need more bytes going up all the way to 1902 ns to allocate 65536 bytes. While malloc only needed 52 ns.
So when to use arena.allocate
or malloc
?
Arena positives:
- Memory is managed for you
- Is part of the JDK
- Memory is zeroed at the start
Arena downsides:
- Slow (if you don't need zeroed memory)
- You cannot free a single allocation yourself, only the entire arena.
Malloc positives:
- Fast
- Consistent speed
- You can free it when you need to
Malloc downsides:
- You need to manage memory yourself
- Not part of the JDK
Skipping Arena's allocateFrom()
Does that mean you should use Malloc everywhere? Definitely not. Take a look at the following result of allocating and copying a String into native memory. Arena is part of the JDK so it has access to fields and classes we developers don't have access to. Meaning that we need to do more effort to achieve the same.
|
|
While malloc is faster than allocate, all the extra effort of copying a String makes it slower.
Conclusion
These are some tricks I used when creating bindings for Liburing and needed to outperform fileChannels. In this post we looked at how to prevent useless copying and translating between Java and C, Create wrappers to limit the number of down calls, and managing memory allocation yourself can really save time.
Got another tip or trick? Or didn't something work for you? Feel free to reach out!
Source
Here you can find the source code for this post: https://github.com/davidtos/BenchmarkUnsafeVsPanama