# Dealing with TMI in Statistics

# Dealing with TMI in Statistics

Join the DZone community and get the full member experience.

Join For Free**The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.**

*something*, and we use an

*estimator*of that

*something*(instead of the true value) then there will be some additional uncertainty.

For instance, consider a random sample, i.i.d., from a

*Gaussian*distribution. Then, a confidence interval for the mean is

*something*is was talking about earlier) is usually unknown. So we substitute an estimation of the standard deviation, e.g.

We call it a

*cost*since the new confidence interval is now larger (the Student distribution has higher upper-quantiles than the Gaussian distribution).

So usually, if we substitute an estimation to the true value, there is a price to pay.

A few years ago, with Jean David Fermanian and Olivier Scaillet, we were writing a survey on copula density estimation (using kernels, here). At the end, we wanted to add a small paragraph on the fact that we assumed that we wanted to fit a copula on a sample i.i.d. with distribution , a copula, but in practice, we start from a sample with joint distribution (assumed to have continuous margins, and - unique - copula ). But since margins are usually unknown, there should be a price for not observing them.

To be more formal, in a perfect wold, we would consider

My point is that when I ran simulations for the survey (the idea was more to give illustrations of several techniques of estimation, rather than proofs of technical theorems) we observed that the price to pay... was negative ! I.e. the variance of the estimator of the density (wherever on the unit square) was smaller on the pseudo sample than on

*perfect*sample .

By that time, we could not understand why we got that counter-intuitive result: even if we do know the

*true*distribution, it is better not to use it, and to use instead a nonparametric estimator. Our interpretation was based on the discrepancy concept and was related to the latin hypercube construction:

With ranks, the data are more regular, and marginal distributions are

*exactly*uniform on the unit interval. So there is less variance.

This was our heuristic interpretation.

A couple of weeks ago, Christian Genest and Johan Segers proved that intuition in an article published in JMVA,

Well, we observed something for finite , but Christian and Johan obtained an analytical result. Hence, if we denote

**Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.**

Published at DZone with permission of Arthur Charpentier , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}