I hear a lot of people saying that simple models work better than complex models when you have enough data. For example, here’s a tweet from Giuseppe Paleologo this morning:
There are several things people could mean when they say that complex models don’t scale well.
First, they may mean that the implementation of complex models doesn’t scale. The computational effort required to fit the model increases disproportionately with the amount of data.
Second, they could mean that complex models aren’t necessary. A complex model might do even better than a simple model, but simple models work well enough given lots of data.
A third possibility, less charitable than the first two, is that the complex models are a bad fit, and this becomes apparent given enough data. The data calls the model’s bluff. If a statistical model performs poorly with lots of data, it must have performed poorly with a small amount of data too, but you couldn’t tell. It’s simple over-fitting.
I believe that’s what Giuseppe had in mind in his remark above. When I replied that the problem is modeling error, he said “Yes, big time.” The results of asymptotic statistics scale beautifully when the model is correct. But giving a poorly fitting model more data isn’t going to make it perform better.
The wrong conclusion would be to say that complex models work well for small data. I think the conclusion is that you can’t tell that complex models are not working well with small data. It’s a researcher’s paradise. You can fit a sequence of ever more complex models, getting a publication out of each. Evaluate your model using simulations based on your assumptions and you can avoid the accountability of the real world.
If the robustness of simple models is important with huge data sets, it’s even more important with small data sets.
Model complexity should increase with data, not decrease. I don’t mean that it should necessarily increase, but that it could. With more data, you have the ability to test the fit of more complex models. When people say that simple models scale better, they may mean that they haven’t been able to do better, that the data has exposed the problems with other things they’ve tried.