Over a million developers have joined DZone.

Prediction Competitions

DZone's Guide to

Prediction Competitions

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Com­pe­ti­tions have a long his­tory in fore­cast­ing and pre­dic­tion, and have been instru­men­tal in forc­ing research atten­tion on meth­ods that work well in prac­tice. In the fore­cast­ing com­mu­nity, the M com­pe­ti­tion and M3 com­pe­ti­tion have been par­tic­u­larly influ­en­tial. The data min­ing com­mu­nity have the annual KDD cup which has gen­er­ated atten­tion on a wide range of pre­dic­tion prob­lems and asso­ci­ated meth­ods. Recent KDD cups are hosted on kag­gle.

In my research group meet­ing today, we dis­cussed our (lim­ited) expe­ri­ences in com­pet­ing in some Kag­gle com­pe­ti­tions, and we reviewed the fol­low­ing two papers which describe two pre­dic­tion competitions:

  1. Athana­sopou­los and Hyn­d­man (IJF 2011). The value of feed­back in fore­cast­ing com­pe­ti­tions. [preprint ver­sion]
  2. Roy et al (2013). The Microsoft Aca­d­e­mic Search Dataset and KDD Cup 2013.

Some points of discussion:

  • The old style of com­pe­ti­tion where par­tic­i­pants make a sin­gle sub­mis­sion and the results are com­piled by the orga­niz­ers is much less effec­tive than com­pe­ti­tions involv­ing feed­back and a leader­board (such as those hosted on kag­gle). The feed­back seems to encour­age par­tic­i­pants to do bet­ter, and the results often improve sub­stan­tially dur­ing the competition.
  • Too many sub­mis­sions results in over-​​fitting to the test data. There­fore the final scores need to be based on a dif­fer­ent test data set than the data used to score the sub­mis­sions dur­ing the com­pe­ti­tion. Kag­gle does not do this, although they par­tially address the prob­lem by com­put­ing the leader­board scores on a sub­set of the final test set.
  • The met­ric used in the com­pe­ti­tion is impor­tant, and this is some­times not thought through care­fully enough by com­pe­ti­tion organizers.
  • There are sev­eral com­pe­ti­tion plat­forms avail­able now includ­ing Kag­gle, Crow­d­An­a­lytix and Tunedit.
  • The best com­pe­ti­tions are focused on spe­cific domains and prob­lems. For exam­ple, the GEF­com 2014 com­pe­ti­tions are about spe­cific prob­lems in energy forecasting.
  • Com­pe­ti­tions are great for advanc­ing knowl­edge of what works, but they do not lead to data sci­en­tists being well paid as many peo­ple com­pete but few are rewarded.
  • The IJF likes to pub­lish papers from win­ners of pre­dic­tion com­pe­ti­tions because of the exten­sive empir­i­cal eval­u­a­tion pro­vided by the com­pe­ti­tion. How­ever, a con­di­tion of pub­li­ca­tion is that the code and meth­ods are fully revealed, and win­ners are not always happy to comply.
  • The IJF will only pub­lish com­pe­ti­tion results if they present new infor­ma­tion about pre­dic­tion meth­ods, or tackle new pre­dic­tion prob­lems, or mea­sure pre­dic­tive accu­racy in new ways. Just run­ning another com­pe­ti­tion like the pre­vi­ous ones is not enough. It still has to involve gen­uine research results.
  • I would love to see some seri­ous research about pre­dic­tion com­pe­ti­tions, but that would prob­a­bly require a com­pany like kag­gle to make their data pub­lic. See Frank Diebold’s com­ments on this too.
  • A nice side effect of some com­pe­ti­tions is that they cre­ate a bench­mark data set with well tested bench­mark meth­ods. This has worked well for the M3 data, for exam­ple, and new time series fore­cast­ing algo­rithms can be eas­ily tested against these pub­lished results. How­ever, over-​​study of a sin­gle bench­mark data set means that meth­ods are prob­a­bly over-​​fitting to the pub­lished test data. There­fore, a wider range of bench­marks is desirable.
  • Pre­dic­tion com­pe­ti­tions are a fun way to hone your skills in fore­cast­ing and pre­dic­tion, and every stu­dent in this field is encour­aged to com­pete in a few com­pe­ti­tions. I can guar­an­tee you will learn a great deal about the chal­lenges of pre­dict­ing real data — some­thing you don’t always learn in classes or via textbooks.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}