Gandiva Initiative: Adding a User Defined Function to Gandiva
Learn how to add user-defined functions to Gandiva, the new LLVM-based execution kernel for Arrow, for optimized compilation of expressions.
Join the DZone community and get the full member experience.Join For Free
You’re probably already aware of the recently announced Gandiva Initiative for Apache Arrow, but for those who need a refresher, this is the new execution kernel for Arrow that is based on LLVM. Gandiva provides very significant performance improvements for low-level operations on Arrow buffers. We first included this work in Dremio to improve the efficiency and performance of analytical workloads on our platform, which will become available to users later this year. In this post we will provide a brief overview for how you would develop a simple function in Gandiva as well as how to submit it to the Arrow community.
Fundamentally, Gandiva uses LLVM to do just-in-time compilation of expressions. The dynamic part of the LLVM IR code is generated using an IRBuilder, and the static part is generated at compile time using clang. At run-time, both the parts are combined into a single module and optimized together. For most new UDFs, adding a hook in the static IR generation technique is sufficient. More details about Gandiva layering and optimizations can be found here.
The functions supported in Gandiva are classified into one of three categories based on how they treat null values. Gandiva uses this information during code-generation to reduce branch instructions, and thereby, increasing CPU pipelining. The three categories are as follows:
1. NULL_IF_NULL Category
In this category, the result of the function is null if and only if one or more of the input parameters are null. Most arithmetic and logical functions come under this category.
For these functions, the Gandiva layer does all of the null handling. The actual function definition can simply ignore nulls.
2. NULL_NEVER Category
In this category, the result of the function cannot be null i.e the result is non-nullable. But, the result value depends on the validity of the input parameters. The function prototype includes both the value and the validity of each input parameter.
3. NULL_INTERNAL Category
In this category, the result of the function may be null based on some internal logic that depends on the value of the internal values/validity. The function prototype includes both the value and the validity of each input parameter, and a pointer for the result validity (bool).
Steps for Adding a UDF in Gandiva
Ok, now we’ll get into the details explaining the steps required for adding a UDF in Gandiva. We’ll show a simple example using the NULL_IF_NULL category function (as previously described) that returns the average of two integers.
1. Download and Build Gandiva
Clone the Gandiva git repository and build it on your test machine or desktop. Please follow the instructions here.
2. Add the Code for the New Function
We will add our simple function to the existing arithmetic_ops.cc. For more complex functions or types it’s better to add to a new file.
3. Add Function Details in the Function Registry
The function registry includes the details of all supported functions in Gandiva. Add this line to the pc_registry_ array in function_registry.cc
This registers our simple function with:
- External name as my_average
- Takes two input parameters: both of type int32
- Returns output parameter of type int32
- Function category NULL_IF_NULL
- Implemented in function my_average_int32_int32
4. Add Unit Tests
For this simple function, we will skip adding a unit test. For complex functions, it’s recommended to add a unit test for the new function.
We’ll add an integ test by building a projector in projector_test.cc that computes the average of two columns.
5. Build Gandiva
Build Gandiva and run the tests.
6. Submit Your Function and Raise a PR Request
First, you must push the changes to your fork and create a PR against an upstream project. This is best done by pushing to your local repo and raising a PR request on the Gandiva page using the diff with your repo.
Once the PR is created, the community can review the code changes and merge.
The full code listing for this simple function is present in this PR — it includes both a projector test and a filter test.
Other Interesting Experiments to Try
The following experiments are not required but are interesting to explore and see what optimizations are being done by the LLVM function passes.
Check Out the Pre-Compiled IR Code
You can look at the pre-compiled code for your newly added function using the
Open the ir.txt file and search for "myaverage." You should see the IR code (unoptimized)
Check Out the Post-Optimized IR Code
First, modify the optimizer function to dump the IR code. The easiest way to do this is by modifying this line (move the call to DumpIR to outside the if condition.)
Build and run the test again.
You should see the optimized IR code on stdout. You’ll notice that the:
- The function has been inlined (no function calls).
- The code has been vectorized.
As this runs the vectorized snippet shows the function processing four integers at a time: adds 4 integers to 4 integers, divides all the 4 integers by 2, and so on…
In this article, we gave an overview of adding a simple function to Gandiva. In subsequent articles, we’ll extend this to functions for other categories and functions using libraries from c++ std or boost.
We also have more features coming that deal with supporting pluggable function repositories and more optimizations (eg. special handling for batches that have no nulls.) More to follow!
Published at DZone with permission of Ravindra Pindikura. See the original article here.
Opinions expressed by DZone contributors are their own.