Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Why Dataset Over DataFrame?

DZone's Guide to

Why Dataset Over DataFrame?

If you have a choice, you should pick the dataset API in Spark 2.0 over the DataFrame API. Wait... what?! Read on to find out why.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

In this blog, we will learn the advantages that the dataset API in Spark 2.0 has over the DataFrame API.

DataFrame is weakly typed and developers don't get the benefits of the type system. That's why the Dataset APIwas introduced in Spark 2.0. To understand this, consider the following scenario.

Suppose that you want to read the result from a CSV file in a structured way:

scala> val dataframe = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("file:///home/hduser/Documents/emp.csv")
dataframe: org.apache.spark.sql.DataFrame = [ID: int, NAME: string ... 1 more field]

scala> dataframe.select("name").where("ids>1").collect
org.apache.spark.sql.AnalysisException: cannot resolve '`ids`' given input columns: [name]; line 1 pos 0;
'Filter ('ids > 1)
+- Project [name#1]
   +- Relation[ID#0,NAME#1,ADDRESS#2] csv

  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)

So instead of giving you a compilation error, it gives you a runtime error, but in case you used the dataset API, it will give you this compilation error:

scala> val dataset = spark.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("file:///home/hduser/Documents/emp.csv").as[Emp]

dataset: org.apache.spark.sql.Dataset[Emp] = [ID: int, NAME: string ... 1 more field]

The dataset is typed because it operates on domain objects. We can be typesafe here because the return type of dataset here is an emp class:

And if we try to map it to a wrong column, it will give a compilation error:

scala> dataset.filter("id>0")map{_.name1}
:28: error: value name1 is not a member of Emp
dataset.filter("id>0")map{_.name1}

So we can say that the dataset is an alias to DataFrame with type safety because it can operate on domain objects, unlike DataFrames.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,tutorial ,dataset ,dataframe ,api ,spark

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}