DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • NULL in Oracle
  • Navigating the Divide: Distinctions Between Time Series Data and Relational Data
  • Introduction to NoSQL Database
  • Common Mistakes to Avoid When Writing SQL Code

Trending

  • 5 Web3 Trends to Follow in 2023
  • Four Ways for Developers To Limit Liability as Software Liability Laws Seem Poised for Change
  • Memory Management in Java: An Introduction
  • Anomaly Detection: Leveraging Rule Engines to Minimize False Alarms
  1. DZone
  2. Data Engineering
  3. Databases
  4. Understanding Spark SQL, DataFrames, and Datasets

Understanding Spark SQL, DataFrames, and Datasets

Let's take a look at understanding Spark SQL, DataFrames, and Datasets and explore how to create DataFrames from RDDs.

Teena Vashist user avatar by
Teena Vashist
·
Sep. 14, 18 · Analysis
Like (5)
Save
Tweet
Share
27.78K Views

Join the DZone community and get the full member experience.

Join For Free

There was a lot of confusion about the Datasets and DataFrame APIs, so in this article, we will learn about Spark SQL, DataFrames, and Datasets.

Spark SQL

It is a Spark Module for structured data processing, which allows you to write less code to get things done, and underneath the covers, it intelligently performs optimizations. The Spark SQL module consists of two main parts. We will only discuss the first part in this article, which is the representation of the Structure APIs, called DataFrames and Datasets, which define the high-level APIs for working with structured data.

One of the cool features of the Spark SQL module is the ability to execute SQL queries to perform data processing and the result of the queries will be returned as a Dataset or DataFrame. The Spark SQL module makes it easy to read data and write data from and to any of the following formats; CSV, XML, and JSON, and common formats for binary data are Avro, Parquet, and ORC.

DataFrames

A dataframe is a distributed collection of data that is organized into rows, where each row consists of a set of columns, and each column has a name and an associated type. In other words, this distributed collection of data has a structure defined by a schema. You can think you of it as a table in a relational database, but under the hood, it has much richer optimizations.

Like the RDD, the DataFrame offers two type of operations: transformations and actions

Transformations are lazily evaluated, and actions are eagerlyevaluated.

Creating DataFrames

There are several ways to create a DataFrame; one common thing among them is the need to provide a schema, either implicitly or explicitly.

The following code will work perfectly from Spark 2.x with Scala 2.11

Creating DataFrames from RDDs

val rdd = sc.parallelize(1 to 10).map(x => (x, x * x))
val dataframe = spark.createDataFrame(rdd).toDF("key", "sqaure")
dataframe.show()

//Output:

+---+-----+
|key|value|
+---+-----+
|  1|    1|
|  2|    4|
|  3|    9|
|  4|   16|
|  5|   25|
|  6|   36|
|  7|   49|
|  8|   64|
|  9|   81|
| 10|  100|
+---+-----+

Datasets

A Dataset is a strongly typed, immutable collection of data. Similar to a DataFrame, the data in a Dataset is mapped to a defined schema. It is more about type safety and is object-oriented.

There are a few important differences between a DataFrame and a Dataset.

  • Each row in a Dataset is represented by a user-defined object so that you can refer to an individual column as a member variable of that object. This provides you with compile-type safety.
  • A Dataset has helpers called encoders, which are smart and efficient encoding utilities that convert data inside each user-defined object into a compact binary format. This translates into a reduction of memory usage if and when a Dataset is cached in memory as well as a reduction in the number of bytes that Spark needs to transfer over a network during the shuffling process.

Creating Datasets

There are a few ways to create a Dataset:

  • The first way is to transform a DataFrame to a Dataset using the as(Symbol) function of the DataFrame class.
  • The second way is to use the SparkSession.createDataset() function to create a Dataset from a local collection of objects.
  • The third way is to use the toDS implicit conversion utility.

Let's see different ways of creating Datasets

// create a Dataset using SparkSession.createDataset() and the toDS 
val movies = Seq(Movie("DDLJ", "Awesome", 2018L),  Movie("ADHM", "Nice", 2018L))
val moviesDS = spark.createDataset(localMovies)
moviesDS.show()
val moviesDS1 = localMovies.toDS()
localMoviesDS1.show()

// Encoders are created for case classes
case class Employee(name: String, age: Long)
val caseClassDS = Seq(Employee("Amy", 32)).toDS
caseClassDS.show()

// convert DataFrame to strongly typed Dataset
case class Movie(actor_name:String, movie_title:String, produced_year:Long)
val movies = Seq(("Damon, Matt", "The Bourne Ultimatum", 2007L),
    ("Damon, Matt", "Good Will Hunting", 1997L))
val moviesDF = movies.toDF.as[Movie]

Thank you for reading this article, I hope it was helpful to you.

sql Database Relational database Data processing

Opinions expressed by DZone contributors are their own.

Related

  • NULL in Oracle
  • Navigating the Divide: Distinctions Between Time Series Data and Relational Data
  • Introduction to NoSQL Database
  • Common Mistakes to Avoid When Writing SQL Code

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: