Testing for Equal Variance: F-Test, Bartlett, Levene

Learn how to test whether two datasets come from distributions with the same variance with the two-sample t-test.

John Cook

May. 18, 18 · Tutorial

Likes (3)

Comment

Save

6.9K Views

The two-sample t-test is a way to test whether two datasets come from distributions with the same mean. I wrote a few days ago about how the test performs under ideal circumstances, as well as less-than-ideal circumstances.

This is an analogous post for testing whether two datasets come from distributions with the same variance. Statistics texts books often present the F-test for this task, then warn in a footnote that the test is highly dependent on the assumption that both datasets come from normal distributions.

Sensitivity and Robustness

Statistics texts give too little attention to robustness in my opinion. Modeling assumptions never hold exactly, so it's important to know how procedures perform when the assumptions don't hold exactly. Since the F-test is one of the rare instances where textbooks warn about a lack of robustness, I expected the F-test to perform terribly under simulation, relative to its recommended alternatives Bartlett's test and Levene's test. That's not exactly what I found.

Simulation Design

For my simulations, I selected 35 samples from each of two distributions. I selected significance levels for the F-test, Bartlett's test, and Levene's test so that each would have roughly a 5% error rate under a null scenario, both sets of data coming from the same distribution, and a 20% error rate under an alternative scenario.

I chose my initial null and alternative scenarios to use normal (Gaussian) distributions, i.e. to satisfy the assumptions of the F-test. Then, I used the same designs for data coming from a heavy-tailed distribution to see how well each of the tests performed.

For the normal null scenario, both datasets were drawn from a normal distribution with mean 0 and standard deviation 15. For the normal alternative scenario, I used normal distributions with standard deviations 15 and 25.

Normal Distribution Calibration

Here are the results from the normal distribution simulations.

|----------+-------+--------+---------|
| Test     | Alpha | Type I | Type II |
|----------+-------+--------+---------|
| F        |  0.13 | 0.0390 |  0.1863 |
| Bartlett |  0.04 | 0.0396 |  0.1906 |
| Levene   |  0.06 | 0.0439 |  0.2607 |
|----------+-------+--------+---------|

Here, the Type I column is the proportion of times the test incorrectly concluded that identical distributions had unequal variances. The Type II column reports the proportion of times the test failed to conclude that distributions with different variances indeed had unequal variances. Results were based on simulating 10,000 experiments.

The three tests had roughly equal operating characteristics. The only difference that stands out above simulation noise is that the Levene test had a larger Type II error than the other tests when calibrated to have the same Type I error.

To calibrate the operating characteristics, I used alpha levels 0.15, 0.04, and 0.05 respectively for the F, Bartlett, and Levene tests.

Heavy-Tail Simulation Results

Next, I used the design parameters above, i.e. the alpha levels for each test, but drew data from distributions with a heavier tail. For the null scenario, both datasets were drawn from a Student t distribution with 4 degrees of freedom and scale 15. For the alternative scenario, the scale of one of the distributions was increased to 25. Here are the results, again based on 10,000 simulations.

|----------+-------+--------+---------|
| Test     | Alpha | Type I | Type II |
|----------+-------+--------+---------|
| F        |  0.13 | 0.2417 |  0.2852 |
| Bartlett |  0.04 | 0.2165 |  0.2859 |
| Levene   |  0.06 | 0.0448 |  0.4537 |
|----------+-------+--------+---------|

The operating characteristics degraded when drawing samples from a heavy-tailed distribution, t with 4 degrees of freedom, but they didn't degrade uniformly.

Compared to the F-test, the Bartlett test had slightly better Type I error and the same Type II error.

The Levene test had a much lower Type I error than the other tests, hardly higher than it was when drawing from a normal distribution, but had a higher Type II error.

Conclusion

The F-test is indeed sensitive to departures from the Gaussian assumption, but Bartlett's test doesn't seem much better in these particular scenarios. Levene's test, however, does perform better than the F-test, depending on the relative importance you place on Type I and Type II error.

Testing Distribution (differential geometry)

Published at DZone with permission of John Cook, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending