DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • BigQuery DataFrames in Python
  • The Full-Stack Developer's Blind Spot: Why Data Cleansing Shouldn't Be an Afterthought
  • Data Quality: A Novel Perspective for 2025
  • Why Database Migrations Take Months and How to Speed Them Up

Trending

  • A Modern Stack for Building Scalable Systems
  • Agile and Quality Engineering: A Holistic Perspective
  • AI’s Role in Everyday Development
  • Docker Base Images Demystified: A Practical Guide
  1. DZone
  2. Data Engineering
  3. Data
  4. What Are Spark Checkpoints on Data Frames?

What Are Spark Checkpoints on Data Frames?

Checkpoints freeze the content of your data frames before you do something else. They're essential to keeping track of your data frames.

By 
Jean-Georges Perrin user avatar
Jean-Georges Perrin
·
Feb. 09, 17 · Opinion
Likes (4)
Comment
Save
Tweet
Share
71.9K Views

Join the DZone community and get the full member experience.

Join For Free

Let’s understand what can checkpoints do for your Spark data frames and go through a Java example on how we can use them.

Checkpoint on Data Frames

In v2.1.0, Apache Spark introduced checkpoints on data frames and datasets. I will continue to use the term "data frame" for a Dataset<Row>. The Javadoc describes it as:

Returns a checkpointed version of this dataset. Checkpointing can be used to truncate the logical plan of this dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set with  SparkContext#setCheckpointDir.

However, I think it requires a little more explanation.

Why Would I Use a Checkpoint?

Basically, I use a checkpoint if I want to freeze the content of my data frame before I do something else. It can be in the scenario of iterative algorithms (as mentioned in the Javadoc) but also in recursive algorithms or simply branching out a data frame to run different kinds of analytics on both.

Spark has been offering checkpoints on streaming since earlier versions (at least v1.2.0), but checkpoints on data frames are a different beast.

Types of Checkpoints

You can create two kinds of checkpoints.

Eager Checkpoint

An eager checkpoint will cut the lineage from previous data frames and will allow you to start “fresh” from this point on. In clear, Spark will dump your data frame in a file specified by setCheckpointDir() and will start a fresh new data frame from it. You will also need to wait for completion of the operation.

Non-Eager Checkpoint

On the other hand, a non-eager checkpoint will keep the lineage from previous operations in the data frame.

Implementing the Code

Now that we understand what a checkpoint is and how it works, let’s see how we implement that in Java. The code is part of my Apache Spark Java Cookbook on GitHub.

public class DataframeCheckpoint {
    public static void main(String[] args) {
        DataframeCheckpoint app = new DataframeCheckpoint();
        app.start();
    }

    private void start() {
        SparkConf conf = new SparkConf().setAppName("Checkpoint").setMaster("local[*]");
        SparkContext sparkContext = new SparkContext(conf);
        // We need to specify where Spark will save the checkpoint file. It can be an HDFS location.
        sparkContext.setCheckpointDir("/tmp");
        SparkSession spark = SparkSession.builder().appName("Checkpoint").master("local[*]").getOrCreate();

        String filename = "data/tuple-data-file.csv";
        Dataset<Row> df1 = spark.read().format("csv").option("inferSchema", "true").option("header", "false")
                .load(filename);
        System.out.println("DF #1 - step #1: simple dump of the dataframe");
        df1.show();

        System.out.println("DF #2 - step #2: same as DF #1 - step #1");
        Dataset<Row> df2 = df1.checkpoint(false);
        df2.show();

        df1 = df1.withColumn("x", df1.col("_c0"));
        System.out.println("DF #1 - step #2: new column x, which is the same as _c0");
        df1.show();

        System.out.println("DF #2 - step #2: no operation was done on df2");
        df2.show();
    }
}

The execution will be, without much surprise:

DF #1 - step #1: simple dump of the dataframe
+---+---+
|_c0|_c1|
+---+---+
|  1|  5|
|  2| 13|
|  3| 27|
|  4| 39|
|  5| 41|
|  6| 55|
+---+---+

DF #2 - step #2: same as DF #1 - step #1
+---+---+
|_c0|_c1|
+---+---+
|  1|  5|
|  2| 13|
|  3| 27|
|  4| 39|
|  5| 41|
|  6| 55|
+---+---+

DF #1 - step #2: new column x, which is the same as _c0
+---+---+---+
|_c0|_c1|  x|
+---+---+---+
|  1|  5|  1|
|  2| 13|  2|
|  3| 27|  3|
|  4| 39|  4|
|  5| 41|  5|
|  6| 55|  6|
+---+---+---+

DF #2 - step #2: no operation was done on df2
+---+---+
|_c0|_c1|
+---+---+
|  1|  5|
|  2| 13|
|  3| 27|
|  4| 39|
|  5| 41|
|  6| 55|
+---+---+

Although this example is really basic, it explains how to use checkpoint on a data frame and see the evolution after the data frame. Hopefully, this will be useful to you, too.

A comment is always appreciated! By the way, thanks to Burak Yavuz at Databricks for his additional explanations.

Checkpoint (pinball) Frame (networking) Data (computing)

Opinions expressed by DZone contributors are their own.

Related

  • BigQuery DataFrames in Python
  • The Full-Stack Developer's Blind Spot: Why Data Cleansing Shouldn't Be an Afterthought
  • Data Quality: A Novel Perspective for 2025
  • Why Database Migrations Take Months and How to Speed Them Up

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!