Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

The Dominant APIs of Spark: Datasets, DataFrames, and RDDs

DZone's Guide to

The Dominant APIs of Spark: Datasets, DataFrames, and RDDs

Learn about the use cases, features, and drawbacks for DataFrames, Datasets, and RDDs in Spark, and see what they have to do with performance and optimization.

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

While working with Spark, often we come across the three APIs: DataFrames, Datasets, and RDDs. In this blog, I will discuss the three in terms of performance and optimization. There is seamless transformation available between DataFrames, Datasets, and RDDs. Implicitly, the RDD forms the apex of DataFrame and Datasets.

The inception of the three is somewhat described below:

RDD (Spark 1.0) > Data Frame(Spark 1.3) > Dataset (Spark 1.6).

Let's begin with the Resilient Distributed Dataset (RDD).

RDD

The crux of a Spark lies in the RDD. It is an immutable distributed collection of elements partitioned across the nodes of the cluster that can be operated on in parallel with low-level APIs, allowing easy transformations and actions.

Use Cases

  • On unstructured data, like streams.

  • When data manipulation involves constructs of functional programming.

  • The data access and processing is free of schema impositions.

  • Require low-level transformations and actions.

Features of RDDs

There are several salient features of RDDs.

Versatile

It can easily and efficiently process both structured and unstructured data. It is available in several programming languages like Java, Scala, Python, and R.

Distributed Collection

It is based on MapReduce operations that are widely popular for processing and generating large sets of data in parallel using distributed algorithms on a cluster. It allows us to write parallel computations with the help of high-level operators, without the overhead of work distribution and fault tolerance.

Immutable

RDDs are a collection of records that are partitioned. A partition is a primitive unit of parallel programming in an RDD, and every partition forms a logical division of data that is immutable and generated with transformations on existing partitions.

Fault Tolerant

In the case of a loss of RDD, one can redo the transformation on that same partition and achieve the same computation results rather than do data replication across multiple nodes.

Lazy Evaluations

All transformations are lazy — they don't compute their results right away. Transformations are performed as required and are then returned to the caller program.

Drawbacks for RDDs

There's no built-in optimization engine. When working with structured data, RDDs do not take advantage of Spark’s advanced optimizers (catalyst optimizer and Tungsten execution engine). Developers need to optimize each RDD based on its characteristics attributes.

Also, unlike DataFrames and Datasets, RDDs don’t infer the schema of the data ingested — the user is required to specify it explicitly.

DataFrames

DataFrames are immutable distributed collections of data in which the data is organized in a relational manner — that is, named columns drawing parallels to tables in a relational database. The essence of datasets is to superimpose a structure on the distributed collection of data in order to allow efficient and easier processing. It is conceptually very equivalent to a table in a relational database. Along with DataFrames, Spark also uses catalyst optimizer.

Features

Following are the salient features of DataFrames.

  • They are conceptually equivalent to a table in a relational database — but have richer optimizations.

  • They can process structured and unstructured data formats (i.e. Avro, CSV, ElasticSearch, and Cassandra) and storage systems (i.e. HDFS, HIVE tables, and MySQL).

  • They empower SQL queries and the DataFrame API.

Drawbacks of DataFrames

The DataFrame API does not support compile time safely, which limits the user when manipulating data when the structure of the data is not known.

Also, after the transformation of domain object into a DataFrame, the user cannot regenerate it.

Datasets

Datasets acquire two discrete APIs characteristics; namely, strongly typed and untyped. A DataFrame can be seen as a collection of generic type Dataset [Row], where the Row can be a generic and untyped JVM object.

Also, unlike DataFrames, Datasets are by default a collection of strongly typed JVM objects. In Java, they are mapped by class. In Scala, they are governed by case class.

Datasets provide static-type and runtime-type safety. Datasets and DataFrames allow us to catch errors at compile time. Another advantage is that DataFrames render a structured view for semi-structured data as a collection of Datasets [Row].

At the core of the API is an encoder responsible for the conversion between JVM objects and tabular representation. This representation is stored in the Tungsten Binary Format, improving memory utilization.

Features

Datasets offer the best of both worlds, including:

  • Functional programming.

  • Type safety.

  • Query optimization.

  • Encoding.

Drawbacks of DataSets

Datasets require type casting into strings. Querying currently requires specification of a class as a String and later casting of a column into a data type.

Let's now discuss the type safety figuratively:

Error SQL DataFrame Datasets
Syntax error Runtime Compile time Compile time
Analysis error Runtime Runtime Compile time

Now, let's discuss this in terms of performance and optimization.

Performance Optimization

The DataFrame and Dataset APIs use Catalyst to generate optimized logical and physical plans under Java, Scala, or Python.

Also, the Dataset [T]-type API is optimized for engineering tasks; the DataFrame is faster and more suitable for interactive analysis.

Space Optimization
memory-usage-when-caching-datasets-vs-rdds

The presence of encoders in the Dataset API efficiently serializes and deserializes JVM objects to generate compact bytecode. A smaller bytecode ensures faster execution speeds.

Having discussed all the important aspects related to Spark APIs, the blog would be incomplete if I didn’t discuss the use case of each of them against the other.

When to use DataFrames or Datasets:

  • Rich semantics.
  • High-level abstractions.
  • Domain-specific APIs.
  • Processing of high-level operations (i.e. filters maps).
  • Use columnar access and lambda functions on semi-structured data.

When to use Datasets:

  • High-degree safety at runtime.
  • Take advantage of typed JVM objects.
  • Take advantage of the Catalyst optimizer.
  • Save space.
  • Faster execution.

When to use DataFrames:

  • Low-level functionality.
  • Tight control.

References

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

Topics:
big data ,tutorial ,spark ,api ,datasets ,dataframes ,rdd ,spark apis ,spark performance optimization

Published at DZone with permission of Pallavi Singh, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}