DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Schema Change Management Tools: A Practical Overview
  • Vector Databases Are Reinventing How Unstructured Data Is Analyzed
  • Projections/DTOs in Spring Data R2DBC
  • Testcontainers With Kotlin and Spring Data R2DBC

Trending

  • Internal Developer Portals: Modern DevOps's Missing Piece
  • Develop a Reverse Proxy With Caching in Go
  • Apache Doris vs Elasticsearch: An In-Depth Comparative Analysis
  • Hybrid Cloud vs Multi-Cloud: Choosing the Right Strategy for AI Scalability and Security
  1. DZone
  2. Data Engineering
  3. Data
  4. Apache Spark: Setting Default Values

Apache Spark: Setting Default Values

Quick tip: Prefer using DataFrameNaFunctions to prevent code duplication when the same default values are set in several queries that use the same DataFrame.

By 
Igor Sorokin user avatar
Igor Sorokin
·
Sep. 04, 17 · Tutorial
Likes (7)
Comment
Save
Tweet
Share
27.2K Views

Join the DZone community and get the full member experience.

Join For Free

Using a default value instead of 'null' is a common practice, and as a Spark's struct field can be nullable, it applies to DataFrames too. Let's say that we have a DataFrame of music tracks playing with the following schema:

root
 |-- id: string (nullable = false)
 |-- trackId: string (nullable = false)
 |-- trackLength: long (nullable = true)
 |-- albumId: string (nullable = true)
 |-- completionPercentage: double (nullable = true)
 |-- totalPlayedMillis: long (nullable = false)


And we want to find:

  • Plays that are less than 30 seconds

  • Plays that are more than 1 minute with a completion percentage greater or equal to 80%

For both of the queries, we need to set default values:

  • "no album" if  'album' is 'null'

  • 0 if  'completionPercentage' is 'null'.

These queries could be written as: 

Dataset<Row> dataFrame = session.createDataFrame(source, structType);
Dataset<Row> playsLessThan30seconds = getPlaysLessThan30seconds(dataFrame);
Dataset<Row> playGte1minuteWithCompletionGte80 = getPlayGte1minuteWithCompletionPercentageGte80(dataFrame);
...

private static Dataset<Row> getPlaysLessThan30seconds(Dataset<Row> dataFrame) {
  return dataFrame
      .select(
          col("id"),
          col("trackId"),
          coalesce(col("albumId"), lit("no album"))
              .as("albumId"),
          col("trackLength"),
          coalesce(col("completionPercentage"), lit(0.0D))
              .as("completionPercentage")
      )
      .where(col("totalPlayedMillis").lt(30000L));
}

private static Dataset<Row> getPlayGte1minuteWithCompletionPercentageGte80(Dataset<Row> dataFrame) {
  return dataFrame
      .select(
          col("id"),
          col("trackId"),
          coalesce(col("albumId"), lit("no album"))
              .as("albumId"),
          col("trackLength"),
          coalesce(col("completionPercentage"), lit(0.0D))
              .as("completionPercentage")
      )
      .where(col("totalPlayedMillis").gt(60000L)
             .and(col("completionPercentage").geq(0.8D)));
}


The result for a test dataset is:

+---------------------------------------------------------------------------------------------+
|                                      Plays less than 30 seconds                             |             |
+----------+----------------+--------------------------------+-----------+--------------------+
|id        |trackId         |albumId                         |trackLength|completionPercentage|
+----------+----------------+--------------------------------+-----------+--------------------+
|oZzaToTxqy|I Keep Coming   |0                               |360000     |0.06388888888888888 |
|bejFaTssDw|Una palabra     |no album                        |180000     |0.10555555555555556 |
|jtydOYuSkJ|A 1000 Times    |I Had A Dream That You Were Mine|240000     |0.0                 |
+----------+----------------+--------------------------------+-----------+--------------------+

+--------------------------------------------------------------------------+  
|          Plays gte 1 minute with completion percentage gte 80%           |       
+----------+---------------------+--------+-----------+--------------------+
|id        |trackId              |albumId |trackLength|completionPercentage|
+----------+---------------------+--------+-----------+--------------------+
|nJXKxjobwo|Wrench and Numbers   |Fargo   |300000     |0.8                 |
|svyULOUcKC|Dutch National Anthem|no album|300000     |0.8666666666666667  |
+----------+---------------------+--------+-----------+--------------------+ 


Unfortunately, C&P comes in to play, therefore, if at some point in time a default value for 'trackLength' is also required, you may end up changing both of these methods. Another disadvantage is that if another similar method, which requires the same default values, is added, code duplication is unavoidable.

A possible solution, which helps to reduce boilerplate, is DataFrameNaFunctions, which is intended to be used for handling missing data: replacing specific values, dropping 'null' and 'NaN', and setting default values:

Dataset<Row> dataFrame = session.createDataFrame(source, structType)
    .na()
    .fill(ImmutableMap.of(
        "albumId", "no album",
        "completionPercentage", 0.0D
    ));
Dataset<Row> playsLessThan30seconds = getPlaysLessThan30seconds(dataFrame);
Dataset<Row> playGte1minuteWithCompletionGte80 = getPlayGte1minuteWithCompletionPercentageGte80(dataFrame);
....

private Dataset<Row> getPlaysLessThan30seconds(Dataset<Row> dataFrame) {
  return dataFrame
    .select("id", "trackId", "albumId", "trackLength", "completionPercentage")
    .where(col("totalPlayedMillis").lt(30000L));
}

private static Dataset<Row> getPlayGte1minuteWithCompletionPercentageGte80(Dataset<Row> dataFrame) {
  return dataFrame
      .select("id","trackId", "albumId", "trackLength", "completionPercentage")
      .where(col("totalPlayedMillis").gt(60000L)
             .and(col("completionPercentage").geq(0.8D)));
}


The result is still the same:

+---------------------------------------------------------------------------------------------+
|                                      Plays less than 30 seconds                             |             |
+----------+----------------+--------------------------------+-----------+--------------------+
|id        |trackId         |albumId                         |trackLength|completionPercentage|
+----------+----------------+--------------------------------+-----------+--------------------+
|oZzaToTxqy|I Keep Coming   |0                               |360000     |0.06388888888888888 |
|bejFaTssDw|Una palabra     |no album                        |180000     |0.10555555555555556 |
|jtydOYuSkJ|A 1000 Times    |I Had A Dream That You Were Mine|240000     |0.0                 |
+----------+----------------+--------------------------------+-----------+--------------------+

+--------------------------------------------------------------------------+  
|          Plays gte 1 minute with completion percentage gte 80%           |       
+----------+---------------------+--------+-----------+--------------------+
|id        |trackId              |albumId |trackLength|completionPercentage|
+----------+---------------------+--------+-----------+--------------------+
|nJXKxjobwo|Wrench and Numbers   |Fargo   |300000     |0.8                 |
|svyULOUcKC|Dutch National Anthem|no album|300000     |0.8666666666666667  |
+----------+---------------------+--------+-----------+--------------------+  

Conclusion

Prefer using DataFrameNaFunctions in order to prevent code duplication when the same default values are set in several queries that use the same DataFrame.

Database Schema Data (computing) Testing SPARK (programming language) Data structure

Opinions expressed by DZone contributors are their own.

Related

  • Schema Change Management Tools: A Practical Overview
  • Vector Databases Are Reinventing How Unstructured Data Is Analyzed
  • Projections/DTOs in Spring Data R2DBC
  • Testcontainers With Kotlin and Spring Data R2DBC

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!