Building Pyarrow With CUDA Support

DZone 's Guide to

Building Pyarrow With CUDA Support

In this article, we discuss how to add CUDA support into Python's pyarrow package.

· Big Data Zone ·
Free Resource

The other day, I was looking to read an Arrow buffer on GPU using Python, but as far as I could tell, none of the provided pyarrow packages on conda or pip are built with CUDA support. Like many of the packages in the compiled-C-wrapped-by-Python ecosystem, Apache Arrow is thoroughly documented, but the number of permutations of how you could choose to build pyarrow with CUDA support quickly becomes overwhelming.

In this post, I’ll show you how to build pyarrow with CUDA support on Ubuntu using Docker and virtualenv. These directions are approximately the same as the official Apache Arrow docs, but here, I explain them step-by-step and show only the single build toolchain I used.

Step 1: Docker With GPU Support

Even though I used Ubuntu 18.04 LTS on a workstation with an NVIDIA GPU, whenever I undertake a project like this, I like to use a Docker container to keep everything isolated. The last thing you want to do is to debug environment errors, changing dependencies for one project and breaking something else. Thankfully, NVIDIA Docker developer images are available via DockerHub:


Here, the -it flag puts us inside the container at a bash prompt; --gpus=all allows the Docker container to access my workstation’s GPUs and --rm deletes the container after we’re done to save space.

Step 2: Setting Up the Ubuntu Docker Container

When you pull Docker containers from DockerHub, frequently, they are bare-bones in terms of libraries included. They usually can also be updated. For building pyarrow, it’s useful to install the following:


In a later step, we’ll use the Arrow third-party dependency script to ensure all needed dependencies are present, but these are a good start.

Step 3: Cloning Apache Arrow From GitHub

Cloning Arrow from GitHub is pretty straightforward. The git checkout apache-arrow-0.15.0 line is optional; I needed version 0.15.0 for the project I was exploring, but if you want to build from the master branch of Arrow, you can omit that line.


Step 4: Installing Remaining Apache Arrow Dependencies

As mentioned in Step 2, some of the dependencies for building Arrow are system-level and can be installed via apt. To ensure that we have all the remaining third-party dependencies, we can use the provided script in the Arrow repository:


The script downloads all of the necessary libraries as well as sets environment variables that are picked up later, which is amazingly helpful.

Step 5: Building Apache Arrow C++ Library

Pyarrow links to the Arrow C++ bindings, so it needs to be present before we can build the pyarrow wheel:


This is a pretty standard workflow for building a C or C++ library. We create a build directory, called cmake from inside of that directory to set up the options we want to use. Then, we use make and then make install to compile and install the library, respectively. 

I chose all of the -DARROW_* options above just as a copy/paste from the Arrow documentation; Arrow doesn’t take long to build using these options, but it’s possibly the case that only -DARROW_PYTHON=ON and -DARROW_CUDA=ON are truly necessary to build pyarrow.

Step 6: Building Pyarrow Wheel

With the Apache Arrow C++ bindings built, we can now build the Python wheel:


As cmake and make run, you’ll eventually see the following in the build logs, which shows that we’re getting the behavior we want:


When the process finishes, the final wheel will be in the /repos/arrow/python/dist directory.

Step 7 (Optional): Validate Build

If you want to validate that your pyarrow wheel has CUDA installed, you can run the following:


When the line from pyarrow import cuda runs without error, then we know that our pyarrow build with CUDA was successful.

cuda ,gpu ,open-source ,python

Published at DZone with permission of Randy Zwitch . See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}