Using DataFrames for Analytics in the Spark Environment
Using DataFrames for Analytics in the Spark Environment
A DataFrame is a clear way to maintain organized, structured data. Spark SQL on DataFrames lets you interact directly with your data with the powerful Spark engine.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Making the jump from static data to some kind of actionable model is often the real trick of any analytic technique. As data gets bigger and bigger, just handling the “small” stuff can become its own class of problem. Introducing intelligent organizational structure as early as possible can help frame how problems should be approached — as well as moving the small problems to the side where a developer or analyst can focus on the substance of the real problems they want to solve. This article will look at the organizational structure of DataFrames and how they can be leveraged within the Apache Spark environment to fully take advantage of the power of the Spark engine and ecosystem while staying organized within familiar frames of reference.
DataFrames in General
A DataFrame is analogous to a table in a relational database or spreadsheet. It has rows and columns, and data in the cells at the intersection of each row/column. A row is a single set of data that all conceptually goes together. A column represents a certain aspect or attribute of each row and is often composed of data of the same data type across each row. Resultantly, any given portion of data can be coordinated via the row and column in which it lives.
A DataFrame works the same way. While data is often thought of as a collection of many individual items or objects, a DataFrame is a single object of organization around multiple pieces of data. Because of this, it can be worked with as a cohesive and singular object. One DataFrame can be filtered into a subset of its rows. It can create a new column based on operations on existing columns. Multiple DataFrames can be combined in different ways to make new DataFrames.
DataFrames have long been a staple in other analytical frameworks like R and Python (via Pandas). In those environments, organizing data into a DataFrame opens up a whole host of direct (reference indexing, native methods) and indirect (SQL, machine learning integration) functionalities on data to be used in an analytical pipeline. Spark is no different; DataFrames are now a first-class citizen of the Apache Spark engine.
DataFrames in Spark
At the heart of Spark is a dataset. All Spark operations take place against a distributed dataset. A DataFrame in Spark is just an additional layer of organization overlaid on a dataset. In a DataFrame, the dataset contains named columns. Just like a table in a relational database or a spreadsheet, a given column can now be indexed and referenced by its name — either directly in programming a Spark job or indirectly when using the data in another component of the Spark ecosystem like a Machine Learning function or SQL.
Getting Data Into a DataFrame
If data is already stored and is to be loaded into a Spark session from an existing relational style storage (such as Hive, Impala, or MySQL tables), then a DataFrame structure can be inferred that will clone the data’s existing structure. If not, then the schema of the DataFrame can be specifically mapped to an existing dataset by providing the column name and datatype as metadata for each attribute of the underlying data.
Spark has compatibility for other common data formats that can be logically mapped to a DataFrame easily. Common flat formats like JSON and CSV can be read directly. More complex formats that preserve metadata like Avro and Parquet can also be read into a Spark DataFrame without having to re-specify the organizational structure of the underlying data.
Spark SQL and DataFrames
The biggest instance of similarity between a Spark DataFrameand a table in a relational table is the ability to write SQLdirectly against the data. Once a dataset is organized into a DataFrame, Spark SQL allows a user to write SQL that can be executed by the Spark engine against that data. The organizational structure that the DataFrame creates allows the same powerful execution engine to be used and to take advantage of built-in processing optimizations. To the user, the data can be interacted with as though in a database, but with the full parallelizable capabilities of the Spark environment.
DataFrame Methods and Pipeline Changing
One of the most powerful capabilities of traditional DataFrames is creating pipelines of chained operations on data that are arbitrarily complex. Unlike SQL, which is compiled into a single set of operations, chained pipelines of operations make data operations more intuitive and natural to conceive without having to nest multiple subqueries over and over.
A normal complex SQL statement normally has up to one of each major operation type: a filter, a group by, a windowed function, an aggregation, and some joins. However, it can take some mental gymnastics to think of a problem in this way, and may take multiple SQL statements and sub-queries to accomplish. A DataFrame has a number of native methods accessible. And since the output of many native methods on a DataFrame is another DataFrame — they can be chained together into arbitrarily complex pipelines.
Imagine a processing pipeline of functions that started with a filter, then aggregated those results, then joined in another data set based on that aggregation, then reordered the results of that and carried out a window function…and so on. The capabilities are practically endless and the logic of the analysis is free to follow the thought process of the analyst. Yet still, under the hood, the Spark engine will process and optimize the underlying operations necessary to accomplish this arbitrarily complex series. This combination becomes an incredibly powerful tool in the box of an analyst to bend data to their will.
Under the Hood
While a normal Spark RDD dataset is a collection of individual data elements, each of which may be constructed by one or many objects, a DataFrame is, at its core, a single object in memory reference space. It serves both as a cohesive object as well as a container object for the elements that make it up. This means that full parallelization is still a first-class feature. Most operations can still be processed across an arbitrarily large number of nodes in a cluster, meaning the size of a DataFrame has no practical limit. However, a user still gets the good aspects of treating a DataFrame like a singular object, similar to using R or Python.
Lower-level than the DataFrame as a whole is the Row object that makes up each cohesive component of a DataFrame. Like a row in a relational database, the Row object in a Spark DataFrame keeps the column attributes and data available as a cohesive single object for given “observation” of data. Rows have their own methods and magic methods, as well; for example, a Row can convert itself into a hashmap/dictionary and can evaluate whether it contains a given column or not.
Differences From Traditional DataFrames
This is the root of the biggest difference between a DataFrame in R/Python Pandas and a Spark DataFrame from a technical perspective. A traditional DataFrame is a columnar structure.For a DataFrame having 10 rows of data with 3 columns, under the hood, the frame is stored as 3 arrays (vectors) of length 10 each (along with some meta-data). A given “row” in the DataFrame is achieved by indexing into each column array to the same position. This is achievable since the entire DataFrame is held in memory while processing.
That is not the case in a Spark DataFrame. A Spark DataFrame needs to be able to process its operations in parallel, often on different servers that cannot communicate their state to one another during the processing stages. Thus, the data is stored in a relational format rather than a columnar one; where the natural division of objects is along a given row rather than a cohesive column. Because it is common to have far more rows than columns as data gets very large, this organizational structure means that operations are parallelized from the row level instead of the column level. To the user, almost all of this is handled in the engine layer where interactions are not substantially different.
Apache Spark is highly effective not because it reinvents the wheel but because it amplifies existing analytical methods to be as highly effective across massive data sets, as they are to comparably small ones you can process on a single machine. Spark DataFrames are revolutionary precisely because they are, at heart, not a whole new thing but rather a bridge from the tried and true analytical methods of the past to the scalable nature necessary for the future. Everything old is new again!
Opinions expressed by DZone contributors are their own.