Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Why Dataset Over DataFrame?

DZone's Guide to

Why Dataset Over DataFrame?

If you have a choice, you should pick the dataset API in Spark 2.0 over the DataFrame API. Wait... what?! Read on to find out why.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

In this blog, we will learn the advantages that the dataset API in Spark 2.0 has over the DataFrame API.

DataFrame is weakly typed and developers don't get the benefits of the type system. That's why the Dataset APIwas introduced in Spark 2.0. To understand this, consider the following scenario.

Suppose that you want to read the result from a CSV file in a structured way:

scala> val dataframe = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("file:///home/hduser/Documents/emp.csv")
dataframe: org.apache.spark.sql.DataFrame = [ID: int, NAME: string ... 1 more field]

scala> dataframe.select("name").where("ids>1").collect
org.apache.spark.sql.AnalysisException: cannot resolve '`ids`' given input columns: [name]; line 1 pos 0;
'Filter ('ids > 1)
+- Project [name#1]
   +- Relation[ID#0,NAME#1,ADDRESS#2] csv

  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)

So instead of giving you a compilation error, it gives you a runtime error, but in case you used the dataset API, it will give you this compilation error:

scala> val dataset = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("file:///home/hduser/Documents/emp.csv").as[Emp]

dataset: org.apache.spark.sql.Dataset[Emp] = [ID: int, NAME: string ... 1 more field]

The dataset is typed because it operates on domain objects. We can be typesafe here because the return type of dataset here is an emp class:

And if we try to map it to a wrong column, it will give a compilation error:

scala> dataset.filter("id>0")map{_.name1}
:28: error: value name1 is not a member of Emp
dataset.filter("id>0")map{_.name1}

So we can say that the dataset is an alias to DataFrame with type safety because it can operate on domain objects, unlike DataFrames.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
big data ,tutorial ,dataset ,dataframe ,api ,spark

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}