While working with Spark, often we come across the three APIs: DataFrames, Datasets, and RDDs. In this blog, I will discuss the three in terms of performance and optimization. There is seamless transformation available between DataFrames, Datasets, and RDDs. Implicitly, the RDD forms the apex of DataFrame and Datasets.
The inception of the three is somewhat described below:
RDD (Spark 1.0) > Data Frame(Spark 1.3) > Dataset (Spark 1.6).
Let's begin with the Resilient Distributed Dataset (RDD).
The crux of a Spark lies in the RDD. It is an immutable distributed collection of elements partitioned across the nodes of the cluster that can be operated on in parallel with low-level APIs, allowing easy transformations and actions.
On unstructured data, like streams.
When data manipulation involves constructs of functional programming.
The data access and processing is free of schema impositions.
Require low-level transformations and actions.
Features of RDDs
There are several salient features of RDDs.
It can easily and efficiently process both structured and unstructured data. It is available in several programming languages like Java, Scala, Python, and R.
It is based on MapReduce operations that are widely popular for processing and generating large sets of data in parallel using distributed algorithms on a cluster. It allows us to write parallel computations with the help of high-level operators, without the overhead of work distribution and fault tolerance.
RDDs are a collection of records that are partitioned. A partition is a primitive unit of parallel programming in an RDD, and every partition forms a logical division of data that is immutable and generated with transformations on existing partitions.
In the case of a loss of RDD, one can redo the transformation on that same partition and achieve the same computation results rather than do data replication across multiple nodes.
All transformations are lazy — they don't compute their results right away. Transformations are performed as required and are then returned to the caller program.
Drawbacks for RDDs
There's no built-in optimization engine. When working with structured data, RDDs do not take advantage of Spark’s advanced optimizers (catalyst optimizer and Tungsten execution engine). Developers need to optimize each RDD based on its characteristics attributes.
Also, unlike DataFrames and Datasets, RDDs don’t infer the schema of the data ingested — the user is required to specify it explicitly.
DataFrames are immutable distributed collections of data in which the data is organized in a relational manner — that is, named columns drawing parallels to tables in a relational database. The essence of datasets is to superimpose a structure on the distributed collection of data in order to allow efficient and easier processing. It is conceptually very equivalent to a table in a relational database. Along with DataFrames, Spark also uses catalyst optimizer.
Following are the salient features of DataFrames.
They are conceptually equivalent to a table in a relational database — but have richer optimizations.
They can process structured and unstructured data formats (i.e. Avro, CSV, ElasticSearch, and Cassandra) and storage systems (i.e. HDFS, HIVE tables, and MySQL).
They empower SQL queries and the DataFrame API.
Drawbacks of DataFrames
The DataFrame API does not support compile time safely, which limits the user when manipulating data when the structure of the data is not known.
Also, after the transformation of domain object into a DataFrame, the user cannot regenerate it.
Datasets acquire two discrete APIs characteristics; namely, strongly typed and untyped. A DataFrame can be seen as a collection of generic type Dataset [Row], where the Row can be a generic and untyped JVM object.
Also, unlike DataFrames, Datasets are by default a collection of strongly typed JVM objects. In Java, they are mapped by class. In Scala, they are governed by case class.
Datasets provide static-type and runtime-type safety. Datasets and DataFrames allow us to catch errors at compile time. Another advantage is that DataFrames render a structured view for semi-structured data as a collection of Datasets [Row].
At the core of the API is an encoder responsible for the conversion between JVM objects and tabular representation. This representation is stored in the Tungsten Binary Format, improving memory utilization.
Datasets offer the best of both worlds, including:
Drawbacks of DataSets
Datasets require type casting into strings. Querying currently requires specification of a class as a String and later casting of a column into a data type.
Let's now discuss the type safety figuratively:
|Syntax error||Runtime||Compile time||Compile time|
|Analysis error||Runtime||Runtime||Compile time|
Now, let's discuss this in terms of performance and optimization.
The DataFrame and Dataset APIs use Catalyst to generate optimized logical and physical plans under Java, Scala, or Python.
Also, the Dataset [T]-type API is optimized for engineering tasks; the DataFrame is faster and more suitable for interactive analysis.
The presence of encoders in the Dataset API efficiently serializes and deserializes JVM objects to generate compact bytecode. A smaller bytecode ensures faster execution speeds.
Having discussed all the important aspects related to Spark APIs, the blog would be incomplete if I didn’t discuss the use case of each of them against the other.
When to use DataFrames or Datasets:
- Rich semantics.
- High-level abstractions.
- Domain-specific APIs.
- Processing of high-level operations (i.e. filters maps).
- Use columnar access and lambda functions on semi-structured data.
When to use Datasets:
- High-degree safety at runtime.
- Take advantage of typed JVM objects.
- Take advantage of the Catalyst optimizer.
- Save space.
- Faster execution.
When to use DataFrames:
- Low-level functionality.
- Tight control.