Over a million developers have joined DZone.

Data Calls the Model's Bluff

DZone's Guide to

Data Calls the Model's Bluff

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

I hear a lot of people saying that simple models work better than complex models when you have enough data. For example, here’s a tweet from Giuseppe Paleologo this morning:

Isn’t it ironic that almost all known results in asymptotic statistics don’t scale well with data?

There are several things people could mean when they say that complex models don’t scale well.

First, they may mean that the implementation of complex models doesn’t scale. The computational effort required to fit the model increases disproportionately with the amount of data.

Second, they could mean that complex models aren’t necessary. A complex model might do even better than a simple model, but simple models work well enough given lots of data.

A third possibility, less charitable than the first two, is that the complex models are a bad fit, and this becomes apparent given enough data. The data calls the model’s bluff. If a statistical model performs poorly with lots of data, it must have performed poorly with a small amount of data too, but you couldn’t tell. It’s simple over-fitting.

I believe that’s what Giuseppe had in mind in his remark above. When I replied that the problem is modeling error, he said “Yes, big time.” The results of asymptotic statistics scale beautifully when the model is correct. But giving a poorly fitting model more data isn’t going to make it perform better.

The wrong conclusion would be to say that complex models work well for small data. I think the conclusion is that you can’t tell that complex models are not working well with small data. It’s a researcher’s paradise. You can fit a sequence of ever more complex models, getting a publication out of each. Evaluate your model using simulations based on your assumptions and you can avoid the accountability of the real world.

If the robustness of simple models is important with huge data sets, it’s even more important with small data sets.

Model complexity should increase with data, not decrease. I don’t mean that it should necessarily increase, but that it could. With more data, you have the ability to test the fit of more complex models. When people say that simple models scale better, they may mean that they haven’t been able to do better, that the data has exposed the problems with other things they’ve tried.

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.


Published at DZone with permission of John Cook, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.


Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.


{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}