DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • When Coalesce Is Slower Than Repartition: A Spark Performance Paradox
  • Designing Scalable Ingestion and Access Layers for Policy and Enforcement Data
  • Offline Data Pipeline Best Practices Part 2:Optimizing Airflow Job Parameters for Apache Hive
  • 6 Best Practices to Build Data Pipelines

Trending

  • Context Is the New Schema
  • Why AI Forces a Rethink of Everything We Know About Software Security
  • Java Backend Development in the Era of Kubernetes and Docker
  • Swift Concurrency Part 4: Actors, Executors, and Reentrancy
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Efficient Sampling Approach for Large Datasets

Efficient Sampling Approach for Large Datasets

In this article, we will learn about the central limit theorem and how it helps with random sampling in big-data-related problems.

By 
Rajesh Vakkalagadda user avatar
Rajesh Vakkalagadda
·
Jan. 22, 26 · Analysis
Likes (0)
Comment
Save
Tweet
Share
1.0K Views

Join the DZone community and get the full member experience.

Join For Free

Sampling is a fundamental process in machine learning that involves selecting a subset of data from a larger dataset. This technique is used to make training and evaluation more efficient, especially when working with massive datasets where processing every data point is impractical

However, sampling comes with its own challenges. Ensuring that samples are representative is crucial to prevent biases that could lead to poor model generalization and inaccurate evaluation results. The sample size must strike a balance between performance and resource constraints. Additionally, sampling strategies need to account for factors such as class imbalance, temporal dependencies, and other dataset-specific characteristics to maintain data integrity.

The most common languages used for sampling are Scala and PySpark, both maintained by the Apache Foundation. And a common challenge in these languages is that when we do sampling, the entire dataset is stored in one machine, thereby leading to memory errors.

In this article, I will explain how random sampling can be achieved at scale using Scala Spark and how the central limit theorem can be extended to solve this problem.

Problem Statement

One of the most common ways to obtain a sample dataset in Spark is by using the df.sample method on a DataFrame. In Spark, a DataFrame is a distributed object that stores your dataset’s rows and enables large-scale data processing. It leverages the MapReduce paradigm to perform operations efficiently across a cluster (we’ll cover MapReduce in more detail in future articles).

While the sample method is convenient, it can become problematic as your data grows. Internally, (Older versions) Spark collects the relevant data onto a single machine in the cluster before applying the random sampling logic. If your grouping or sampling operation results in a large dataset, this single machine may run out of memory (OOM error).

To address this, you might consider using larger machines or increasing the available memory. However, both approaches have drawbacks:

  • Larger instances increase your cloud provider costs.

When you use larger instances just to support sampling or a large job with skewed data, your entire cluster can be idle, which can lead to increased costs. Ideal usage should be around 80-90% of the cluster for CPU and memory. I will share more details, specifically on emitting metrics from the Apache Spark cluster and how we can use it to enable alarms for various scenarios.

  • More memory can slow down your cluster, especially for jobs that don’t require such large resources.

If we use more memory just to solve the runtime issue, some jobs in the partition might not need that much memory, and they will be underutilized even though the Spark UI shows full memory usage. You will observe that the number of CPUs used decreases in this use case, but the memory is still allocated. For an efficient job, you need your jobs to be able to work on smaller chunks of data.

It’s important to note that Spark allocates memory for the entire MapReduce job up front, so optimizing your sampling strategy is crucial for both performance and cost efficiency. While the above two solutions work, they are sub-optimal. To mitigate this, we can use the central limit theorem.

Solution

To solve this problem, we can use the central limit theorem, which basically says that if we pick a number between 0 and 1 and repeat this process n times, then the numbers are normally distributed. This can be applied to our sampling usecase here.

If we want to sample, say, 10% of the rows, then for each row we randomly pick a number between 0 and 1, filter the rows that are greater than 0.1, and then sample the dataset to 10% randomly.

Here is a sample code to achieve this sampling.

Scala
 
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, Encoders, Row}



object DistributedSampler {

  def distributedSample1(df: DataFrame, fraction: Double): DataFrame = {
    df.withColumn("rand_val", rand())       // add random number 0–1
      .filter(col("rand_val") < fraction)   // keep rows that fall under the threshold
  }


  // sample each partition independently
  def distributedSample(df: DataFrame, fraction: Double): DataFrame = {
    df.mapPartitions { iter =>
      val rnd = new scala.util.Random()
      iter.filter(_ => rnd.nextDouble() < fraction)
    }(Encoders.kryo[Row])   // or Encoders.row
  }
}


Test code to verify the results with Spark sampling vs. the new approach:

Scala
 
import org.apache.spark.sql.SparkSession

object SamplingTest {
  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder()
      .appName("SamplingTest")
      .master("local[*]")
      .getOrCreate()

    import spark.implicits._

    val df = spark.range(0, 1000000000).toDF("id")

    // Spark built-in
    val t0 = System.currentTimeMillis()
    val s1 = df.sample(false, 0.1)
    println(s1.count())
    val t1 = System.currentTimeMillis()

    println(s"Default Spark sample time: ${t1 - t0} ms")

    // Distributed-safe
    val t2 = System.currentTimeMillis()
    val s2 = DistributedSampler.distributedSample1(df, 0.1)
    println(s2.count())
    val t3 = System.currentTimeMillis()

    println(s"Distributed sample time:   ${t3 - t2} ms")

    spark.stop()
  }
}


Results

Number of rows

Spark default sampling time (ms)

New approach sampling time (ms)

10000

1071

186

100000

1183

224

1000000

1215

239

10000000

1295

243

100000000

1310

321


From the results, we can see that the new approach performed better than Spark's default sampling. I am running this on a single-node machine, so the results might slightly vary, but the sample code shows that the approach works and is efficient.

Limitations

This solution only works if we want to sample the entire dataset randomly; this will not work. In some cases, I have seen records going beyond the sample size by a few records because of the nature of randomness; it's not a big number, but something to be aware of. For small datasets, randomness is high; for example, sampling 1000/10000 records can be done directly by Spark sampling, but if the number of rows exceeds 100k, the CLT approach also works efficiently.

In the next article, we will discuss emitting metrics from Spark and how it helps in analyzing your jobs.

Apache Spark Data (computing)

Opinions expressed by DZone contributors are their own.

Related

  • When Coalesce Is Slower Than Repartition: A Spark Performance Paradox
  • Designing Scalable Ingestion and Access Layers for Policy and Enforcement Data
  • Offline Data Pipeline Best Practices Part 2:Optimizing Airflow Job Parameters for Apache Hive
  • 6 Best Practices to Build Data Pipelines

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook