In statistical learning, the Kolmogorov-Smirnov test is used to check whether 2 samples follow the same distribution. In this article, I will show how to perform statistical hypothesis test using Spark machine learning API (aka **Spark MLlib**). The **Kolmogorov-Smirnov (KS) **algorithm will be used to reveal the hypothesis testing with a step-by-step by example on the well known **Haberman's survival dataset. **

However, before going into deeper, some background knowledge is mandated for them who are new to the area of statistical machine learning. Furthermore, a technical detail on the hypothesis testing using **Pearson’s chi-squared test **algorithm can be found in my previous article here.

**What Is Hypothesis Testing?**

A **statistical hypothesis **[1] is a hypothesis that can be tested based on an observation modeled via a set of random variables. A **statistical hypothesis test** is, therefore, a method of statistical inference – where, more than one experimental datasets are compared. These datasets can be obtained by sampling and compared against a synthetic dataset from an idealized model [1].

In other words, a hypothesis is proposed and carried out for finding the statistical relationship between two datasets. More technically, the findings are compared as an alternative to an idealized null hypothesis that proposes no relationship between two datasets [1]. However, this comparison is deemed **statistically insignificant** if the relationship between the datasets would be an unlikely realization of the null hypothesis according to a threshold probability. The threshold probability is also called the ** significance level** [1, 2].

The process of differentiating between the **null hypothesis** and **alternative hypothesis** [3] is aided by identifying two conceptual types of errors – also called type 1 & type 2 errors. Additionally, a parametric limit is specified on how many type 1 errors will be permitted[1, 3]. One of the most common selection techniques is based on either **Akaike information criterion** or **Bayes factor** [1] algorithm. Consequently, hypothesis testing is useful as well powerful tool in statistics to determine:

i) Whether a result is **statistically significant**

ii) Whether a result is **occurred by chance or not**

Because of the confirmatory nature, a statistical hypothesis test is also called **confirmatory data analysis. **Correspondingly**,** it can be contrasted with **exploratory data analysis** that might not have pre-specified hypotheses [2, 3].

### Hypothesis Testing and the Confirmatory Data Analysis

In statistics, **exploratory data analysis(EDA),initial data analysis** (IDA) and **confirmatory data analysis (CDA) **are three important aspects to reaching out a statistical hypothesis with necessary data analytical solution. An EDA based data analysis was first promoted by **John Tukey** [4] to encourage the statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments [1]. As a result, the **EDA** is an approach for analyzing data to summarize their main characteristics, usually done by visual methods like graphs, plot or interactive visualization [5].

However, EDA is different from that of the IDA. The former is focused on checking assumptions required for model fitting and hypothesis testing. The latter deals with handling missing values and making transformations of variables using popular algorithms like TF-IDF, Vector assembler or Indexer etc. as transformers [1]. Here comes the essentiality of hypothesis testing in statistics to determine whether a result is **statistically significant**. Alternatively, to determine if a particular result has occurred by chance or naturally.

Note that EDA needs to be performed before the CDA. The reason is natural, since if you want to perform the confirmatory test, you will be needing to specify the related statistic like mean, median, mode and standard deviation etc. Therefore, in a nutshell, statistical hypothesis testing is sometimes called CDA.

## How Is It Performed?

According to the documentation provided by Wikipedia [1], the following two types of reasoning are considered for performing the hypothesis test:

### Initial Hypothesis-based Approach

There is an ** initial research hypothesis** for which the truth is always

**Now the below steps are typically iterated sequentially:**

*unknown.*- The first step is to state the relevant
and*null**alternative hypotheses* - The second step is to consider the
(i.e,*statistical assumptions***statistical independenc**e) being made about the sample in doing the test - Decide which test is appropriate, and state the relevant
**test statistic**say**T** - From the assumptions, derive the distribution (i.e., most of the cases
) of the test statistic under the*normal distribution**null hypothesis* - Select a
to say defined by*significance level*which is the probability threshold below for which the**α**will be*null hypothesis*(common values are 5% and 1%)*rejected* - Compute the
*observed value*`t`_{obs}from the observations of the test statistic**T**in step 3 - Now based on the value, decide whether to
the*reject*in favor of the*null hypothesis*or*alternate hypothesis*the hypothesis.*accept*

Note that, the assumption in step 2 is ** non-trivial** since

**will tend the results of the test**

*invalid assumptions*

**invalid**.### Alternate Hypothesis-based Approach

Here typically step 1 to 3 are iterated as an **initial hypothesis as **stated above. After that the following steps are performed:

- Compute the observed value
`t`_{obs}from the observations of the test statistic`T in step 4` - Calculate the
[3], which is the*p-value*. More technically, the value of sampling the test statistic at least as extreme as that which was observed*probability under the null hypothesis* - Reject the
**null hypothesis**, in favor of the**alternative hypothesis**, if and only if theis less than the significance level (the*p-value*) threshold.*selected probability*

Now to compute the ** p-value**, here I describe two

**based on several sources [1-3, 5]:**

*rules of thumbs*- If the p-value is
**p > 0.05**(i.e., 5%), accept your hypothesis. Note that a deviation is small enough to take the hypothesis to an acceptance level. A p-value of 0.6, for example, means that there is a 60% probability of any deviation from expected result. However, this is within the range of an acceptable deviation. - If the p-value is
**p < 0.05**, reject your hypothesis by concluding that some factors other than by chance are operating for the deviation to perfect.

Similarly, a ** p-value** of 0.01 means that there is only a 1% chance that this deviation is due to chance alone, which means that other factors must be involved that need to be addressed [5]. Note that the

**may vary for your case depending upon the data quality, dimension, structure and types etc [2].**

*p-value*In a nutshell, if the **p-Value < 0.05** (significance level), we reject the **null hypothesis **that they are drawn from the same distribution. In other words, p < 0.05 implies **x** (independent variable) and **y **(dependent variable) from different distributions.

**Hypothesis Testing and Spark**

As already discussed that the current implementation of Spark (i.e., Spark 2.0.2) provides the hypothesis testing facility on the static as well as dynamic (i.e., streaming) data. Consequently**, **the **Spark MLlib** API currently supports **Pearson’s chi-squared (χ2) tests** for goodness of fit and independence for batch or static dataset. Secondly,`Spark MLlib`

provides a 1-sample, 2-sided implementation of the **KS** test for equality of probability distributions. That to be discussed in the next section.

The third type of hypothesis testing that Spark provides support is using the streaming data. The `Spark MLlib`

provides online implementations of some tests to support use cases like the **A/B testing **[8]. Theis test may be performed on **Spark Discrete Streaming** that takes the streaming of type Boolean and double –i.e., **DStream**[(Boolean, Double)] [5]. Where, the first element of each tuple indicates control group (false) or treatment group (true) and the second element is the value of an observation.

**Kolmogorov-Smirnov Testing With Spark**

By providing the name of a theoretical distribution(currently solely supported for the **normal distribution**) and its parameters, or a function to calculate the **cumulative distribution** according to a given **theoretical distribution**, the user can test the **null hypothesis** that their sample is drawn from that distribution [5].

In the case that the user tests against the normal distribution (`i.e., distName="norm"`

), but does not provide distribution parameters, the test initializes to the standard normal distribution and logs an appropriate message. However, while performing the **KS test**, we need to specify the value of the **mean **and **standard deviation** of the dataset as well. Here is how the **p-value** is explained in Spark implementation (as Java-like pseudocode notation):

```
String pValueExplain = null;
if (pValue <= 0.01)
pValueExplain = "Very strong presumption against null hypothesis"+ nullHypothesis;
else if (0.01 < pValue && pValue <= 0.05)
pValueExplain= "Strong presumption against null hypothesis"+ ullHypothesis;
else if (0.05 < pValue && pValue <= 0.1)
pValueExplain= "Low presumption against null hypothesis"+ nullHypothesis;
else
pValueExplain = "No presumption against null hypothesis" + nullHypothesis;
```

Here we get to know the outcome of the assumption against the * p-value* using the

**pValueExplain**variable and its value changing over time across the data objects.

In this example, I will show a step-by-step example how to perform hypothesis testing using **KolmogorovSmirnov **test on **Haberman survival dataset** [6]. Now, before moving into Spark example, let's explore the dataset first.

### Dataset Exploration

Haberman's Survival Dataset(HSD) can be downloaded from the **UCI machine learning dataset repository** at [6]. As per the dataset description at [7], the **HSD** contains cases from a study that was conducted between 1958 and 1970 at the **University of Chicago's Billings Hospital** on the survival of patients who had undergone surgery for **breast cancer**. The number of instances is 306 and the umber of attributes is 4 (including the class attribute). Here is the related information regarding the 4 attributes:

1. Age of patient at time of operation (numerical)

2. Patient's year of operation (numerical)

3. Number of positive axillary nodes detected (numerical)

4. Survival status (class attribute)

1 = the patient survived 5 years or longer

2 = the patient died within 5 year

Figure 1 shows the top-level view of the Haberman's survival dataset.

### Spark MLlib Solution for the **KolmogorovSmirnov Test **

The Spark based solution demonstrated here has 6 steps from data loading, parsing, test set preparation, testing and finally result from analysis:

**Step-1: Load required packages and APIs**

```
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaDoubleRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.stat.Statistics;
import org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult;
```

**Step-2: Create Spark context**

```
SparkConf conf = new SparkConf().setAppName("JavaHypothesisTestingKolmogorovSmirnovTestExample").setMaster("local");
JavaSparkContext jsc = new JavaSparkContext(conf);
```

**Step-3: Load and parse the dataset to create an ArrayList of Double**

```
String path = "input/haberman.data";//Paht of the data source
BufferedReader br = new BufferedReader(new FileReader(path));
String line = null;
List<Double> myList = new ArrayList<Double>(); // ArrayList of Double
while ((line = br.readLine()) != null) { //Reading the file line by line
String[] tokens = line.split(",");
myList.add(Double.parseDouble(tokens[0])); //Pushing the 1st value
myList.add(Double.parseDouble(tokens[1])); //Pushing the 2nd value
myList.add(Double.parseDouble(tokens[2])); //Pushing the 3rd value
myList.add(Double.parseDouble(tokens[3])); //Pushing the 4th value
}
```

**Step-4: Preparing the test data **

It is to be noted that the **KS test** algorithm accepts and performs the statistical analysis on JavaDoubleRDDs as a **normal distribution**. Consequently, a JavaDoubleRDD can be generated by parallelizing the above list as ArrayList. Therefore, first let's convert the above list as list of double values as follows:

`Double [] list = myList.toArray(new Double[myList.size()]); `

Now that we have the list of doubles and we can prepare the JavaDoubleRDD as follows:

`JavaDoubleRDD data = jsc.parallelizeDoubles(Arrays.asList(list)); `

**Step-5: Perform the Kolmogorov-Smirnov Test **

Perform the **hypothesis **test using the KolmogorovSmirnov algorithm on the test dataset we prepared in the previous step. Also, we need to explicitly specify the distribution as normal. Further, we also need to specify the mean and the standard deviation of the test set we prepared.

`KolmogorovSmirnovTestResult testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 35.0, 1.5);`

Note that here the distribution of the test data (i.e., training data in this case) is normal. The 3rd and 4th parameters are mean and standard deviation (SD) respectively.

The requirement of these parameters signifies why an exploratory analysis is needed to be performed before the confirmatory data analysis. For the brevity and simplicity, we assume the mean as 35.0 and SD as 1.5. Readers should calculate the arithmetic mean and the standard deviation of the distribution to get more accurate result.

**Step-6: Print the test results**

Note the summary of the test including the p-value, test statistic, and null hypothesis if our p-value indicates significance, we can reject the null hypothesis that goes as follows:

`System.out.println(testResult);`

Fort the above test configuration, you should receive the below like results:

Kolmogorov-Smirnov test summary:

degrees of freedom = 0

statistic = 0.49957093966680316

pValue = 2.74724687443495E-11

The Very strong presumption against null hypothesis: Sample follows the theoretical distribution.

## Conclusion

From the above result, we got a larger p-value. Therefore, the initial hypothesis can be accepted. However, for an in-depth analysis on the p-value, readers, are suggested to read the documentation provided at [1-3]. Moreover, selecting the proper dataset and doing the hypothesis test prior applying the hyperparameter tuning are recommended too. Readers are also suggested to find more on statistical learning with Spark MLlib at [5].

The maven friendly **pom.xml **file, associated source codes and **Haberman's survival dataset **can be downloaded from my **GitHub **repository here at [8]. In my next article, I will show how to perform the A/B like testing on real-time streaming data.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}