Efficient Sampling Approach for Large Datasets

In this article, we will learn about the central limit theorem and how it helps with random sampling in big-data-related problems.

Jan. 22, 26 · Analysis

Likes (0)

Comment

Save

1.2K Views

Sampling is a fundamental process in machine learning that involves selecting a subset of data from a larger dataset. This technique is used to make training and evaluation more efficient, especially when working with massive datasets where processing every data point is impractical

However, sampling comes with its own challenges. Ensuring that samples are representative is crucial to prevent biases that could lead to poor model generalization and inaccurate evaluation results. The sample size must strike a balance between performance and resource constraints. Additionally, sampling strategies need to account for factors such as class imbalance, temporal dependencies, and other dataset-specific characteristics to maintain data integrity.

The most common languages used for sampling are Scala and PySpark, both maintained by the Apache Foundation. And a common challenge in these languages is that when we do sampling, the entire dataset is stored in one machine, thereby leading to memory errors.

In this article, I will explain how random sampling can be achieved at scale using Scala Spark and how the central limit theorem can be extended to solve this problem.

Problem Statement

One of the most common ways to obtain a sample dataset in Spark is by using the df.sample method on a DataFrame. In Spark, a DataFrame is a distributed object that stores your dataset’s rows and enables large-scale data processing. It leverages the MapReduce paradigm to perform operations efficiently across a cluster (we’ll cover MapReduce in more detail in future articles).

While the sample method is convenient, it can become problematic as your data grows. Internally, (Older versions) Spark collects the relevant data onto a single machine in the cluster before applying the random sampling logic. If your grouping or sampling operation results in a large dataset, this single machine may run out of memory (OOM error).

To address this, you might consider using larger machines or increasing the available memory. However, both approaches have drawbacks:

Larger instances increase your cloud provider costs.

When you use larger instances just to support sampling or a large job with skewed data, your entire cluster can be idle, which can lead to increased costs. Ideal usage should be around 80-90% of the cluster for CPU and memory. I will share more details, specifically on emitting metrics from the Apache Spark cluster and how we can use it to enable alarms for various scenarios.

More memory can slow down your cluster, especially for jobs that don’t require such large resources.

If we use more memory just to solve the runtime issue, some jobs in the partition might not need that much memory, and they will be underutilized even though the Spark UI shows full memory usage. You will observe that the number of CPUs used decreases in this use case, but the memory is still allocated. For an efficient job, you need your jobs to be able to work on smaller chunks of data.

It’s important to note that Spark allocates memory for the entire MapReduce job up front, so optimizing your sampling strategy is crucial for both performance and cost efficiency. While the above two solutions work, they are sub-optimal. To mitigate this, we can use the central limit theorem.

Solution

To solve this problem, we can use the central limit theorem, which basically says that if we pick a number between 0 and 1 and repeat this process n times, then the numbers are normally distributed. This can be applied to our sampling usecase here.

If we want to sample, say, 10% of the rows, then for each row we randomly pick a number between 0 and 1, filter the rows that are greater than 0.1, and then sample the dataset to 10% randomly.

Here is a sample code to achieve this sampling.

    Scala
   
 

   import org.apache.spark.sql.functions._
import org.apache.spark.sql.{DataFrame, Encoders, Row}



object DistributedSampler {

  def distributedSample1(df: DataFrame, fraction: Double): DataFrame = {
    df.withColumn("rand_val", rand())       // add random number 0–1
      .filter(col("rand_val") < fraction)   // keep rows that fall under the threshold
  }


  // sample each partition independently
  def distributedSample(df: DataFrame, fraction: Double): DataFrame = {
    df.mapPartitions { iter =>
      val rnd = new scala.util.Random()
      iter.filter(_ => rnd.nextDouble() < fraction)
    }(Encoders.kryo[Row])   // or Encoders.row
  }
}

  

Test code to verify the results with Spark sampling vs. the new approach:

    Scala
   
 

   import org.apache.spark.sql.SparkSession

object SamplingTest {
  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder()
      .appName("SamplingTest")
      .master("local[*]")
      .getOrCreate()

    import spark.implicits._

    val df = spark.range(0, 1000000000).toDF("id")

    // Spark built-in
    val t0 = System.currentTimeMillis()
    val s1 = df.sample(false, 0.1)
    println(s1.count())
    val t1 = System.currentTimeMillis()

    println(s"Default Spark sample time: ${t1 - t0} ms")

    // Distributed-safe
    val t2 = System.currentTimeMillis()
    val s2 = DistributedSampler.distributedSample1(df, 0.1)
    println(s2.count())
    val t3 = System.currentTimeMillis()

    println(s"Distributed sample time:   ${t3 - t2} ms")

    spark.stop()
  }
}

  

Results

Number of rows	Spark default sampling time (ms)	New approach sampling time (ms)
10000	1071	186
100000	1183	224
1000000	1215	239
10000000	1295	243
100000000	1310	321

From the results, we can see that the new approach performed better than Spark's default sampling. I am running this on a single-node machine, so the results might slightly vary, but the sample code shows that the approach works and is efficient.

Limitations

This solution only works if we want to sample the entire dataset randomly; this will not work. In some cases, I have seen records going beyond the sample size by a few records because of the nature of randomness; it's not a big number, but something to be aware of. For small datasets, randomness is high; for example, sampling 1000/10000 records can be done directly by Spark sampling, but if the number of rows exceeds 100k, the CLT approach also works efficiently.

In the next article, we will discuss emitting metrics from Spark and how it helps in analyzing your jobs.

Apache Spark Data (computing)

Opinions expressed by DZone contributors are their own.

Related

Trending