Platinum Partner
architects,bigdata,theory,r

Why Every Statistician Should Know About Cross-​​Validation

Sur­pris­ingly, many sta­tis­ti­cians see cross-​​validation as some­thing data min­ers do, but not a core sta­tis­ti­cal tech­nique. I thought it might be help­ful to sum­ma­rize the role of cross-​​validation in sta­tis­tics.

Cross-​​validation is pri­mar­ily a way of mea­sur­ing the pre­dic­tive per­for­mance of a sta­tis­ti­cal model. Every sta­tis­ti­cian knows that the model fit sta­tis­tics are not a good guide to how well a model will pre­dict: high R^2 does not nec­es­sar­ily mean a good model. It is easy to over-​​fit the data by includ­ing too many degrees of free­dom and so inflate R^2 and other fit sta­tis­tics. For exam­ple, in a sim­ple poly­no­mial regres­sion I can just keep adding higher order terms and so get bet­ter and bet­ter fits to the data. But the pre­dic­tions from the model on new data will usu­ally get worse as higher order terms are added.

One way to mea­sure the pre­dic­tive abil­ity of a model is to test it on a set of data not used in esti­ma­tion. Data min­ers call this a “test set” and the data used for esti­ma­tion is the “train­ing set”. For exam­ple, the pre­dic­tive accu­racy of a model can be mea­sured by the mean squared error on the test set. This will gen­er­ally be larger than the MSE on the train­ing set because the test data were not used for estimation.

How­ever, there is often not enough data to allow some of it to be kept back for testing. A more sophis­ti­cated ver­sion of training/​​test sets is leave-​​one-​​out cross-​​​​validation (LOOCV) in which the accu­racy mea­sures are obtained as fol­lows. Sup­pose there are n inde­pen­dent obser­va­tions, y_1,\dots,y_n.

  1. Let obser­va­tion i form the test set, and fit the model using the remain­ing data. Then com­pute the error (e_{i}^*=y_{i}-\hat{y}_{i}) for the omit­ted obser­va­tion. This is some­times called a “pre­dicted resid­ual” to dis­tin­guish it from an ordi­nary residual.
  2. Repeat step 1 for i=1,\dots,n.
  3. Com­pute the MSE from e_{1}^*,\dots,e_{n}^*. We shall call this the CV.

This is a much more effi­cient use of the avail­able data, as you only omit one obser­va­tion at each step. How­ever, it can be very time con­sum­ing to imple­ment (except for lin­ear mod­els — see below).

Other sta­tis­tics (e.g., the MAE) can be com­puted sim­i­larly. A related mea­sure is the PRESS sta­tis­tic (pre­dicted resid­ual sum of squares) equal to n\timesMSE.

Vari­a­tions on cross-​​validation include leave-​​k-​​out cross-​​validation (in which k obser­va­tions are left out at each step) and k-​​fold cross-​​validation (where the orig­i­nal sam­ple is ran­domly par­ti­tioned into k sub­sam­ples and one is left out in each iter­a­tion). Another pop­u­lar vari­ant is the .632+bootstrap of Efron & Tib­shi­rani (1997) which has bet­ter prop­er­ties but is more com­pli­cated to implement.

Min­i­miz­ing a CV sta­tis­tic is a use­ful way to do model selec­tion such as choos­ing vari­ables in a regres­sion or choos­ing the degrees of free­dom of a non­para­met­ric smoother. It is cer­tainly far bet­ter than pro­ce­dures based on sta­tis­ti­cal tests and pro­vides a nearly unbi­ased mea­sure of the true MSE on new observations.

However, as with any vari­able selec­tion pro­ce­dure, it can be mis­used. Beware of look­ing at sta­tis­ti­cal tests after select­ing vari­ables using cross-​​validation — the tests do not take account of the vari­able selec­tion that has taken place and so the p-​​values can mislead.

It is also impor­tant to realise that it doesn’t always work. For exam­ple, if there are exact dupli­cate obser­va­tions (i.e., two or more obser­va­tions with equal val­ues for all covari­ates and for the y vari­able) then leav­ing one obser­va­tion out will not be effective.

Another prob­lem is that a small change in the data can cause a large change in the model selected. Many authors have found that k-​​fold cross-​​validation works bet­ter in this respect.

In a famous paper, Shao (1993) showed that leave-​​one-​​out cross val­i­da­tion does not lead to a con­sis­tent esti­mate of the model. That is, if there is a true model, then LOOCV will not always find it, even with very large sam­ple sizes. In con­trast, cer­tain kinds of leave-​​k-​​out cross-​​validation, where k increases with n, will be con­sis­tent. Frankly, I don’t con­sider this is a very impor­tant result as there is never a true model. In real­ity, every model is wrong, so con­sis­tency is not really an inter­est­ing property.

Cross-​​validation for lin­ear models

While cross-​​validation can be com­pu­ta­tion­ally expen­sive in gen­eral, it is very easy and fast to com­pute LOOCV for lin­ear mod­els. A lin­ear model can be writ­ten as

    \[ \mathbf{Y} = \mathbf{X}\mbox{\boldmath$\beta$} + \mathbf{e}. \]

Then

    \[ \hat{\mbox{\boldmath$\beta$}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} \]

and the fit­ted val­ues can be cal­cu­lated using

    \[ \mathbf{\hat{Y}} = \mathbf{X}\hat{\mbox{\boldmath$\beta$}} = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{Y} = \mathbf{H}\mathbf{Y}, \]

where \mathbf{H} = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' is known as the “hat-​​matrix” because it is used to com­pute \mathbf{\hat{Y}} (“Y-​​hat”).

If the diag­o­nal val­ues of \mathbf{H} are denoted by h_{1},\dots,h_{n}, then the cross-​​validation sta­tis­tic can be com­puted using

    \[ \text{CV} = \frac{1}{n}\sum_{i=1}^n [e_{i}/(1-h_{i})]^2, \]

where e_{i} is the resid­ual obtained from fit­ting the model to all n obser­va­tions. See Christensen’s book Plane Answers to Com­plex Ques­tions for a proof. Thus, it is not nec­es­sary to actu­ally fit n sep­a­rate mod­els when com­put­ing the CV sta­tis­tic for lin­ear mod­els. This remark­able result allows cross-​​validation to be used while only fit­ting the model once to all avail­able observations.

Rela­tion­ships with other quantities

Cross-​​validation sta­tis­tics and related quan­ti­ties are widely used in sta­tis­tics, although it has not always been clear that these are all con­nected with cross-​​validation.

Jack­knife

A jack­knife esti­ma­tor is obtained by recom­put­ing an esti­mate leav­ing out one obser­va­tion at a time from the esti­ma­tion sam­ple. The n esti­mates allow the bias and vari­ance of the sta­tis­tic to be calculated.

Akaike’s Infor­ma­tion Criterion

Akaike’s Infor­ma­tion Cri­te­rion is defined as

    \[ \text{AIC} = -2\log {\cal L}+ 2p, \]

where {\cal L} is the max­i­mized like­li­hood using all avail­able data for esti­ma­tion and p is the num­ber of free para­me­ters in the model. Asymp­tot­i­cally, min­i­miz­ing the AIC is equiv­a­lent to min­i­miz­ing the CV value. This is true for any model (Stone 1977), not just lin­ear mod­els. It is this prop­erty that makes the AIC so use­ful in model selec­tion when the pur­pose is prediction.

Schwarz Bayesian Infor­ma­tion Criterion

A related mea­sure is Schwarz’s Bayesian Infor­ma­tion Criterion:

    \[ \text{BIC} = -2\log {\cal L}+ p\log(n), \]

where n is the num­ber of obser­va­tions used for esti­ma­tion. Because of the heav­ier penalty, the model cho­sen by BIC is either the same as that cho­sen by AIC, or one with fewer terms. Asymp­tot­i­cally, for lin­ear mod­els min­i­miz­ing BIC is equiv­a­lent to leave–v–out cross-​​validation when v = n[1-1/(\log(n)-1)] (Shao 1997).

Many sta­tis­ti­cians like to use BIC because it is con­sis­tent — if there is a true under­ly­ing model, then with enough data the BIC will select that model. How­ever, in real­ity there is rarely if ever a true under­ly­ing model, and even if there was a true under­ly­ing model, select­ing that model will not nec­es­sar­ily give the best fore­casts (because the para­me­ter esti­mates may not be accurate).

Cross-​​validation for time series

When the data are not inde­pen­dent cross-​​validation becomes more dif­fi­cult as leav­ing out an obser­va­tion does not remove all the asso­ci­ated infor­ma­tion due to the cor­re­la­tions with other obser­va­tions. For time series fore­cast­ing, a cross-​​validation sta­tis­tic is obtained as follows

  1. Fit the model to the data y_1,\dots,y_t and let \hat{y}_{t+1} denote the fore­cast of the next obser­va­tion. Then com­pute the error (e_{t+1}^*=y_{t+1}-\hat{y}_{t+1}) for the fore­cast observation.
  2. Repeat step 1 for t=m,\dots,n-1 where m is the min­i­mum num­ber of obser­va­tions needed for fit­ting the model.
  3. Com­pute the MSE from e_{m+1}^*,\dots,e_{n}^*.

Ref­er­ences

An excel­lent and com­pre­hen­sive recent sur­vey of cross-​​validation results is Arlot and Celisse (2010)

Published at DZone with permission of {{ articles[0].authors[0].realName }}, DZone MVB. (source)

Opinions expressed by DZone contributors are their own.

{{ tag }}, {{tag}},

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}
{{ parent.authors[0].realName || parent.author}}

{{ parent.authors[0].tagline || parent.tagline }}

{{ parent.views }} ViewsClicks
Tweet

{{parent.nComments}}