DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Data
  4. Spark Tutorial: Validating Data in a Spark DataFrame - Part One

Spark Tutorial: Validating Data in a Spark DataFrame - Part One

There's more than one way to skin a cat...four easy method to validate data in a Spark DataFrame.

Bipin Patwardhan user avatar by
Bipin Patwardhan
·
Sep. 02, 19 · Tutorial
Like (3)
Save
Tweet
Share
31.61K Views

Join the DZone community and get the full member experience.

Join For Free

Person-holding-frame-overlooking-cliff-side

How your DataFrame looks after this tutorial

Recently, in conjunction with the development of a modular, metadata-based ingestion engine that I am developing using Spark, we got into a discussion relating to data validation. Data validation was a natural next step of data ingestion and that is why we came to that topic.

You might be wondering, "What is so special about data validation? Is it because of Spark?" The reason for this article is partly due to Spark, but more importantly due to the fact that it demonstrates the power of Spark and also illustrates the principle that there is more than one method available to achieve our goal.

You may also like: Apache Spark an Enging for Large-Scale Data Processing

The task at hand was pretty simple — we wanted to create a flexible and reusable library of classes that would make the task of data validation (over Spark DataFrames) a breeze. In this article, I will cover a couple of techniques/idioms used for data validation. In particular, I am using the null check (are the contents of a column 'null'). In order to keep things simple, I will be assuming that the data to be validated has been loaded into a Spark DataFrame named "df."

Method One: Filtering

One of the simplest methods of performing validation is to filter out the invalid records. The method to do so is  val newDF = df.filter(col("name").isNull).

A variant of this technique is:

val newDF = df.filter(col("name").isNull).withColumn("nameIsNull", lit(true))


This technique is overkill — primarily because all the records in newDF  are those records where the name column is not null. Hence, adding a new column with a "true" value is totally unnecessary, as all rows will have the value of this column as 'true'.

Method Two: When/Otherwise

The second technique is to use the "when" and "otherwise" constructs.

val newDF = df.withColumn("nameIsNull", when(col("name") === null ||
col("name").equals(""), true).otherwise(false))

This method adds a new column, that indicates the result of the null comparison for the name column. After this technique, cells in the new column will contain both "true" and "false," depending on the contents of the name column.

Method Three: Using Expr

Another technique is to use the "expr" feature.

val newDF = df.withColumn("nameIsNull", expr("case when name == null then true else false end"))

Method Four: Overkill

Now, look at this technique.

var dfFinal: DataFrame = _
var dfTrue = df.filter(col("name").isNull).withColumn("nameIsNull", lit(true))
var dfFalse = df.filter(!col("name").isNull).withColumn("nameIsNull", lit(false))
if ( dfTrue != null ) {
    dfFinal = dfTrue.union(dfFalse)
} else {
    dfFinal = dfFalse
}


While valid, this technique is clearly an overkill. Not only is it more elaborate when compared to the previous methods, but it is also doing double the work. It is scanning the DataFrame twice - once to evaluate the "true" condition and once more to evaluate the "false" condition.

In this article, I have covered a few techniques that can be used to achieve the simple task of checking if a Spark DataFrame column contains null. The techniques not only illustrate the flexibility of Spark, but also proves the point that we can reach the same end goal using multiple ways.

Obviously, there are some tradeoffs, but sometimes, we may want to choose a method that is simple to understand and maintain, rather than using a technique just because the API provides it. In the next article, I will cover how something similar can be achieved using UDFs.


Related Articles

  • Understanding Spark SQL, DataFrames, and Datasets.

  • A Journey With Scala.

Data processing Apache Spark sql

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Utilize OpenAI API to Extract Information From PDF Files
  • The Role of Data Governance in Data Strategy: Part II
  • Continuous Development: Building the Thing Right, to Build the Right Thing
  • Why It Is Important To Have an Ownership as a DevOps Engineer

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: