Over a million developers have joined DZone.

Better Predictions With Big Data

DZone's Guide to

Better Predictions With Big Data

Things have been very unpredictable lately, to say the least. How can we fix this?

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Recent times have seen our predictive capabilities take a bit of a battering.  Numerous political polls have gotten events ranging from Brexit to the Trump election massively wrong, with senior political figures casting doubts on the ability of ‘experts’ as a result.

Alas, researchers from Columbia, Harvard and Princeton have recently devised a method that they believe will make us better able to make accurate predictions in areas from healthcare to politics.

The approach, which was documented in a recently published paper, aims to build upon previous work by the team that highlighted how certain variables, whilst appearing significant are not particularly useful for making predictions, whilst those that appear insignificant can be very important.

Finding the Key Variables

These early studies raised the question of just what makes a variable useful when forming predictions?  Traditional methods have tried to assign significance to a variable, before then putting them into models.

To provide a more robust approach, the researchers propose a new metric known as the influence score, which will be solely looking at the ability of the variable to predict outcomes.  It’s an approach that, when tested, was found to be reliable in distinguishing between noisy and predictive variables, thus improving the prediction rates quite significantly.  Indeed, in one test the prediction rates for breast cancer leapt from 70% to 92%.  It’s an approach the researchers are confident can be applied to various fields with similar outcomes.

“The practical implications are what drove the project, so they’re quite broad,” they say. “Essentially anytime you might be interested in predicting and identifying highly predictive variables, you might have something to gain by conducting variable selection through a statistic like the I-score, which is related to variable predictivity. That the I-score fares especially well in high dimensional data and with many complex interactions between variables is an extra boon for the researcher or policy expert interested in predicting something with large dimensional data.”

Would it make us any better at predicting election results?  Time will tell I suppose.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

prediction ,big data

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}