Introduction to Benchmarking in Julia
Introduction to Benchmarking in Julia
This post highlights my experience as a beginner and hopefully will show how others can get started in learning to optimize their Julia code.
Join the DZone community and get the full member experience.Join For Free
For the number of years I’ve been programming using Julia, I’ve never really been concerned with performance. Which is to say, I’ve appreciated that other people are interested in performance and have proven that Julia can be as fast as any other performance language out there. But I’ve never been one to pour over the Performance Tips section of the Julia manual trying to squeeze every last bit of performance.
But now that I’ve released OmniSci.jl, and as a company one of our major selling points is accelerated analytics, I figured it was time to stop assuming I wrote decent-ish code and pay attention to performance. This post highlights my experience as a beginner and hopefully will show how others can get started in learning to optimize their Julia code.
You may also like: The Important of Web Performance Benchmarking
Read The Manuals!
As I mentioned above, I’ve written Julia for many years now, and at that time I’ve grown up with many of the tips in the performance tips section of the documentation. Things like “write type stable functions” and “avoid global variables” are things that I’ve internalized as good programming practices, as opposed to doing them just because they are performant. But with this long familiarity with the language comes laziness, and by not reading the BenchmarkTools.jl documentation, I started off benchmarking incorrectly. Consider this example:
The benchmark above tests whether it’s worth pre-allocating the results array vs. using the more convenient dot broadcasting syntax. The idea here is that growing an array over and over can be inefficient when you know the result size at the outset. Yet, comparing the times above, for all statistics pre-allocating the array is slightly worse, even though we’re passing the compiler more knowledge upfront. This didn’t sit well with me, so I consulted the BenchmarkTools.jl manual and found the following about variable interpolation:
A good rule of thumb is that external variables should be explicitly interpolated into the benchmark expression
int64_10x6 input array into the function takes it from being a global variable to a local, and sure enough, we see roughly a 6% improvement in the minimum time when we pre-allocate the array:
Whether that 6% improvement will hold up over time or not, at least conceptually we’re no longer worse off for pre-allocating, which fits my mental model of how things should work.
Evaluate Your Benchmark Over the Range of Inputs You Care About
In the comparison above, I evaluate the benchmark over observations. How did I choose 1 million as the “right” number of events to test, instead of just testing 1 or 10 events? My general goal for benchmarking this code is to speed up the methods of loading data into an OmniSciDB database.
TStringValue is one of the internal methods as part of doing a row-wise table load, converting whatever data is present in an array or DataFrame from
String (think to iterate over a text file by line). Since users trying to accelerate their database operations are probably going to be using millions to billions of data points, I’m interested in understanding how the functions are performing at these volumes of data.
The other conscious decision I made was the environment to test on. I could test this on massive CPU- and GPU-enabled servers, but I’m testing this on my Dell XPS 15 laptop. Why? Because I’m interested in how things are performing under more real-world conditions for a realistic user. Testing the performance characteristics of a high-end server with tons of memory and cores would be fun, but I want to make sure any performance improvements are broadly applicable, instead of just because I am throwing more hardware at the problem.
Less important to me to control for was garbage collection, using a fresh session before each measurement or other “best-case scenario” optimizations. I would expect my users to be more analytics and data science-focused, so re-using the same session is going to be common. If the performance improvements aren’t completely obvious, I’m not going to incorporate them into the codebase.
Case Study: Speeding Up
For my test, I evaluate the following as the methods to benchmark:
- Broadcasting: current library default.
- Pre-allocating result array.
- Pre-allocated result array with
- Pre-allocated result array with threads.
- Pre-allocated result array with threads and
For the first three on the left, these are comparisons of the single-threaded methods. You can see that pre-allocating the output array is marginally faster than broadcasting and using the
@inbounds macro is incrementally faster still, but neither method provides enough speedup to be worth implementing. The difference between the red and the blue bars represents a garbage collection occurs, but again, the three methods aren’t different enough to notice anything interesting.
For the multi-threaded tests, I’m using 6 threads (one per physical core), and we’re seeing roughly a 3x speedup. Like the single-threaded tests above, using
@inbounds is only marginally faster, but not enough to widely implement for the cost of increased code complexity. Interestingly, doing these multi-threaded benchmarks didn’t trigger garbage collect at all across my five iterations; not sure if this is specific due to threading or not, but something to explore outside of this blog post.
To see how these calculation methods might change at a larger scale, I bumped up the observations by an order of 10 and saw the following results:
Like at the 1 million data range, there isn’t much difference between the three single-threaded methods. All three of them are within a few percentages in either direction (all three methods triggered garbage collection in each of their five runs).
For the multi-threaded tests, an interesting performance scenario emerged. Like the 1 million point tests, it’s possible to get a run where garbage collection isn’t triggered, which leads to a large min/median difference in the multi-threaded tests. If you can avoid garbage collection, using six threads here gives nearly a 10x speedup, and at the median where both single-threaded and multi-threaded trigger garbage collection you still get a 2x speedup.
Parallelism > Compiler Hinting
In the case study above, I’ve demonstrated that for this problem, threading is the first way to pursue speeding up the OmniSci.jl load table methods. While pre-allocating the size of the output array and using the
@inbounds did show some slight speedups, using threads to perform the calculations is where the largest improvements occurred.
Incorporating the pre-allocation step naturally comes out from the way I wrote the threading methods, so I’ll incorporate that too. Disabling bounds-checking on arrays using
@inbounds seems more dangerous than it is worth, even though none of these methods should ever get outside of their bounds.
Overall, I hope this post has demonstrated that you don’t have to fancy yourself a high-frequency trader or a bit-twiddler to find ways to improve your Julia code. The first step is reading the manuals for benchmarking, and then like any other pursuit, the only way to get a feeling for what works is to try things.
All of the code for this blog post can be found in this GitHub gist.
Published at DZone with permission of Randy Zwitch . See the original article here.
Opinions expressed by DZone contributors are their own.