Building Pyarrow With CUDA Support
Building Pyarrow With CUDA Support
In this article, we discuss how to add CUDA support into Python's pyarrow package.
Join the DZone community and get the full member experience.Join For Free
The other day, I was looking to read an Arrow buffer on GPU using Python, but as far as I could tell, none of the provided pyarrow packages on conda or pip are built with CUDA support. Like many of the packages in the compiled-C-wrapped-by-Python ecosystem, Apache Arrow is thoroughly documented, but the number of permutations of how you could choose to build pyarrow with CUDA support quickly becomes overwhelming.
In this post, I’ll show you how to build pyarrow with CUDA support on Ubuntu using Docker and virtualenv. These directions are approximately the same as the official Apache Arrow docs, but here, I explain them step-by-step and show only the single build toolchain I used.
Step 1: Docker With GPU Support
Even though I used Ubuntu 18.04 LTS on a workstation with an NVIDIA GPU, whenever I undertake a project like this, I like to use a Docker container to keep everything isolated. The last thing you want to do is to debug environment errors, changing dependencies for one project and breaking something else. Thankfully, NVIDIA Docker developer images are available via DockerHub:
-it flag puts us inside the container at a bash prompt;
--gpus=all allows the Docker container to access my workstation’s GPUs and
--rm deletes the container after we’re done to save space.
Step 2: Setting Up the Ubuntu Docker Container
When you pull Docker containers from DockerHub, frequently, they are bare-bones in terms of libraries included. They usually can also be updated. For building pyarrow, it’s useful to install the following:
In a later step, we’ll use the Arrow third-party dependency script to ensure all needed dependencies are present, but these are a good start.
Step 3: Cloning Apache Arrow From GitHub
Cloning Arrow from GitHub is pretty straightforward. The
git checkout apache-arrow-0.15.0 line is optional; I needed version 0.15.0 for the project I was exploring, but if you want to build from the master branch of Arrow, you can omit that line.
Step 4: Installing Remaining Apache Arrow Dependencies
As mentioned in Step 2, some of the dependencies for building Arrow are system-level and can be installed via apt. To ensure that we have all the remaining third-party dependencies, we can use the provided script in the Arrow repository:
The script downloads all of the necessary libraries as well as sets environment variables that are picked up later, which is amazingly helpful.
Step 5: Building Apache Arrow C++ Library
Pyarrow links to the Arrow C++ bindings, so it needs to be present before we can build the pyarrow wheel:
This is a pretty standard workflow for building a C or C++ library. We create a
build directory, called
cmake from inside of that directory to set up the options we want to use. Then, we use
make and then
make install to compile and install the library, respectively.
I chose all of the
-DARROW_* options above just as a copy/paste from the Arrow documentation; Arrow doesn’t take long to build using these options, but it’s possibly the case that only
-DARROW_CUDA=ON are truly necessary to build pyarrow.
Step 6: Building Pyarrow Wheel
With the Apache Arrow C++ bindings built, we can now build the Python wheel:
make run, you’ll eventually see the following in the build logs, which shows that we’re getting the behavior we want:
When the process finishes, the final wheel will be in the
Step 7 (Optional): Validate Build
If you want to validate that your pyarrow wheel has CUDA installed, you can run the following:
When the line
from pyarrow import cuda runs without error, then we know that our pyarrow build with CUDA was successful.
Published at DZone with permission of Randy Zwitch . See the original article here.
Opinions expressed by DZone contributors are their own.