DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Snowflake Data Processing With Snowpark DataFrames
  • Problem Analysis in Apache Doris StreamLoad Scenarios
  • Data Processing With Python: Choosing Between MPI and Spark
  • Upgrading Spark Pipelines Code: A Comprehensive Guide

Trending

  • A Complete Guide to Modern AI Developer Tools
  • Mastering Advanced Traffic Management in Multi-Cloud Kubernetes: Scaling With Multiple Istio Ingress Gateways
  • The Human Side of Logs: What Unstructured Data Is Trying to Tell You
  • How the Go Runtime Preempts Goroutines for Efficient Concurrency
  1. DZone
  2. Data Engineering
  3. Data
  4. Spark Tutorial: Validating Data in a Spark DataFrame - Part One

Spark Tutorial: Validating Data in a Spark DataFrame - Part One

There's more than one way to skin a cat...four easy method to validate data in a Spark DataFrame.

By 
Bipin Patwardhan user avatar
Bipin Patwardhan
·
Sep. 02, 19 · Tutorial
Likes (3)
Comment
Save
Tweet
Share
36.0K Views

Join the DZone community and get the full member experience.

Join For Free

Person-holding-frame-overlooking-cliff-side

How your DataFrame looks after this tutorial

Recently, in conjunction with the development of a modular, metadata-based ingestion engine that I am developing using Spark, we got into a discussion relating to data validation. Data validation was a natural next step of data ingestion and that is why we came to that topic.

You might be wondering, "What is so special about data validation? Is it because of Spark?" The reason for this article is partly due to Spark, but more importantly due to the fact that it demonstrates the power of Spark and also illustrates the principle that there is more than one method available to achieve our goal.

You may also like: Apache Spark an Enging for Large-Scale Data Processing

The task at hand was pretty simple — we wanted to create a flexible and reusable library of classes that would make the task of data validation (over Spark DataFrames) a breeze. In this article, I will cover a couple of techniques/idioms used for data validation. In particular, I am using the null check (are the contents of a column 'null'). In order to keep things simple, I will be assuming that the data to be validated has been loaded into a Spark DataFrame named "df."

Method One: Filtering

One of the simplest methods of performing validation is to filter out the invalid records. The method to do so is  val newDF = df.filter(col("name").isNull).

A variant of this technique is:

val newDF = df.filter(col("name").isNull).withColumn("nameIsNull", lit(true))


This technique is overkill — primarily because all the records in newDF  are those records where the name column is not null. Hence, adding a new column with a "true" value is totally unnecessary, as all rows will have the value of this column as 'true'.

Method Two: When/Otherwise

The second technique is to use the "when" and "otherwise" constructs.

val newDF = df.withColumn("nameIsNull", when(col("name") === null ||
col("name").equals(""), true).otherwise(false))

This method adds a new column, that indicates the result of the null comparison for the name column. After this technique, cells in the new column will contain both "true" and "false," depending on the contents of the name column.

Method Three: Using Expr

Another technique is to use the "expr" feature.

val newDF = df.withColumn("nameIsNull", expr("case when name == null then true else false end"))

Method Four: Overkill

Now, look at this technique.

var dfFinal: DataFrame = _
var dfTrue = df.filter(col("name").isNull).withColumn("nameIsNull", lit(true))
var dfFalse = df.filter(!col("name").isNull).withColumn("nameIsNull", lit(false))
if ( dfTrue != null ) {
    dfFinal = dfTrue.union(dfFalse)
} else {
    dfFinal = dfFalse
}


While valid, this technique is clearly an overkill. Not only is it more elaborate when compared to the previous methods, but it is also doing double the work. It is scanning the DataFrame twice - once to evaluate the "true" condition and once more to evaluate the "false" condition.

In this article, I have covered a few techniques that can be used to achieve the simple task of checking if a Spark DataFrame column contains null. The techniques not only illustrate the flexibility of Spark, but also proves the point that we can reach the same end goal using multiple ways.

Obviously, there are some tradeoffs, but sometimes, we may want to choose a method that is simple to understand and maintain, rather than using a technique just because the API provides it. In the next article, I will cover how something similar can be achieved using UDFs.


Related Articles

  • Understanding Spark SQL, DataFrames, and Datasets.

Data processing Apache Spark sql

Opinions expressed by DZone contributors are their own.

Related

  • Snowflake Data Processing With Snowpark DataFrames
  • Problem Analysis in Apache Doris StreamLoad Scenarios
  • Data Processing With Python: Choosing Between MPI and Spark
  • Upgrading Spark Pipelines Code: A Comprehensive Guide

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!