Over a million developers have joined DZone.

Why Dataset Over DataFrame?

DZone's Guide to

Why Dataset Over DataFrame?

If you have a choice, you should pick the dataset API in Spark 2.0 over the DataFrame API. Wait... what?! Read on to find out why.

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

In this blog, we will learn the advantages that the dataset API in Spark 2.0 has over the DataFrame API.

DataFrame is weakly typed and developers don't get the benefits of the type system. That's why the Dataset APIwas introduced in Spark 2.0. To understand this, consider the following scenario.

Suppose that you want to read the result from a CSV file in a structured way:

scala> val dataframe = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("file:///home/hduser/Documents/emp.csv")
dataframe: org.apache.spark.sql.DataFrame = [ID: int, NAME: string ... 1 more field]

scala> dataframe.select("name").where("ids>1").collect
org.apache.spark.sql.AnalysisException: cannot resolve '`ids`' given input columns: [name]; line 1 pos 0;
'Filter ('ids > 1)
+- Project [name#1]
   +- Relation[ID#0,NAME#1,ADDRESS#2] csv

  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)

So instead of giving you a compilation error, it gives you a runtime error, but in case you used the dataset API, it will give you this compilation error:

scala> val dataset = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("file:///home/hduser/Documents/emp.csv").as[Emp]

dataset: org.apache.spark.sql.Dataset[Emp] = [ID: int, NAME: string ... 1 more field]

The dataset is typed because it operates on domain objects. We can be typesafe here because the return type of dataset here is an emp class:

And if we try to map it to a wrong column, it will give a compilation error:

scala> dataset.filter("id>0")map{_.name1}
:28: error: value name1 is not a member of Emp

So we can say that the dataset is an alias to DataFrame with type safety because it can operate on domain objects, unlike DataFrames.

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

big data ,tutorial ,dataset ,dataframe ,api ,spark

Published at DZone with permission of Anubhav Tarar, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}