The Plight of the Data Consumer
The Plight of the Data Consumer
Apache Arrow has become the core building block for heterogeneous data infrastructure and tools.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Organizations today find many of their core processes powered by data. Individuals in their respective roles rely on data as part of their daily workflows. These data consumers include analysts and data scientists who are increasingly technology-savvy. They use tools like Python, R, and TensorFlow, as well as BI tools like Tableau, Qlik, and Power BI.
Data consumers rely on having data that is easy to access and fast to use. The challenge is that data is been spread across many systems and in different formats. While it used to be possible and practical for organizations to move their data into one system via ETL tools like Informatica, this is no longer practical. today Modern data architectures — including microservices, NoSQL databases, and cloud services — generate data that is fundamentally incompatible with existing analytical infrastructure. As a result, data consumers are entirely dependent on IT to get access to the data they need.
As an industry, we need to develop technologies that will help us cope with this new reality: technologies that allow systems to efficiently interoperate and allow business analysts and data scientists to use data, despite the fact that it exists in many different systems.
Apache Arrow to the Rescue
Apache Arrow was created by the developers of Dremio, Pandas, and other open-source projects to provide the core data building block for this era of heterogeneous data infrastructure and tools. It’s only been 18 months since the first prototype of Apache Arrow, and already it seems clear that Arrow is becoming a de-facto standard, with over 100,000 downloads a month and adoption across a diverse range of systems:
Data science and machine learning (Pandas, H20, Riselab’s Ray, Spark)
GPU databases (NVIDIA GPU DataFrame, MapD)
Browser visualization (Graphistry)
Self-service data (Dremio)
The major long-term benefit of Arrow is that it enables different systems to talk to each other with almost no overhead because they all use the same memory representation. This is essentially the same as having a common currency or language that is used by everyone. Instead of converting between currencies or translating between languages in different countries, a common standard eliminates friction points, improves accuracy, and allows for efficient exchange of money and ideas.
When we conceived of Arrow back in 2016, we realized that interoperability would not appeal to anyone's project until there were other projects already using it. This is a classic chicken-and-egg problem. The solution to this was to provide not only a specification but also plug-and-play functionality including a rich metadata/type system and processing libraries in multiple languages. This would make Arrow beneficial to any single project that’s using it. For example, you can read about how Pandas is being rearchitected on top of Arrow here.
Apache Arrow in Practice
With the widespread adoption of Arrow now well underway, let’s dive into the details on how Arrow plays a role in data science across data sources. Dremio is an open-source project built on Apache Arrow, which connects to a variety of data sources and provides the ability to execute queries across data in these sources. For column-oriented data sources, like Parquet files in a data lake, Dremio provides native connectors that can directly translate data in a columnar fashion directly into Arrow buffers. For row-oriented data sources like MongoDB or SQL Server, Dremio provides connectors that perform the translation into Arrow record batches, where each batch represents a sequence of rows but is stored column by column in the memory representation of the batch.
Internally, all data in Dremio is represented as Arrow buffers. All processing operators consume and produce Arrow buffers, and all exchanges over the network are streams of Arrow buffers. In addition, data can be cached in memory as Arrow buffers.
The results of a query are also produced as Arrow buffers. These buffers are streamed over the network to the consumer — often a data science or BI tool. Clients such as Pandas (for example, Jupyter Notebooks) are able to work with Arrow buffers directly, thereby alleviating the need for any deserialization and serialization. Unfortunately, not all client applications currently support Arrow natively. For example, many BI clients such as Tableau rely on the legacy ODBC API. For these clients, Dremio includes an ODBC driver which translates record batches from Arrow to ODBC so that the BI tool can consume them.
Today’s data scientists and analysts live in a reality where data is bigger than ever — and also more distributed than ever. Arrow enables faster processing by leveraging the capabilities of modern CPUs as well as GPUs, as well as cross-data source analysis with a de-facto industry-standard memory representation.
Opinions expressed by DZone contributors are their own.