If you train a model on a set of data, it should fit that data well. The hope, however, is that it will fit a new set of data well. So in machine learning and statistics, people split their data into two parts. They train the model on one half, and see how well it fits on the other half. This is called cross validation, and it helps prevent over-fitting, fitting a model too closely to the peculiarities of a data set.
For example, suppose you have measured the value of a function at 100 points. Unbeknownst to you, the data come from a cubic polynomial plus some noise. You can fit these 100 points exactly with a 99th degree polynomial, but this gives you the illusion that you’ve learned more than you really have. But if you divide your data into test and training sets of 50 points each, overfitting on the training set will result in a terrible fit on the test set. If you fit a cubic polynomial to the training data, you should do well on the test set. If you fit a 49th degree polynomial to the training data, you’ll fit it perfectly, but do a horrible job with the test data.
Now suppose we have two kinds of models to fit. We train each on the training set, and pick the one that does better on the test set. We’re not over-fitting because we haven’t used the test data to fit our model. Except we really are: we used the test set to select a model, though we didn’t use the test set to fit the parameters in the two models. Think of a larger model as a tree. The top of the tree tells you which model to pick, and under that are the parameters for each model. When we think of this new hierarchical model as “the model,” then we’ve used our test data to fit part of the model, namely to fit the bit at the top.
With only two models under consideration, this isn’t much of a problem. But if you have a machine learning package that tries millions of models, you can be over-fitting in a subtle way, and this can give you more confidence in your final result than is warranted.
The distinction between parameters and models is fuzzy. That’s why “Bayesian model averaging” is ultimately just Bayesian modeling. You could think of the model selection as simply another parameter. Or you could go the other way around and think of of each parameter value as an index for a family of models. So if you say you’re only using the test data to select models, not parameters, you could be fooling yourself.
For example, suppose you want to fit a linear regression to a data set. That is, you want to pick m and b so that y = mx + b is a good fit to the data. But now I tell you that you are only allowed to fit models with one degree of freedom. You’re allowed to do cross validation, but you’re only allowed to use the test data for model selection, not model fitting.
So here’s what you could do. Pick a constant value of b, call it b0. Now fit the one-parameter model y = mx + b0 on your training data, selecting the value of m only to minimize the error in fitting the training set. Now pick another value of b, call it b1, and see how well it does on the test set. Repeat until you’ve found the best value of b. You’ve essentially used the training and test data to fit a two-parameter model, albeit awkwardly.