When Big Data Uncovers No Data
When Big Data Uncovers No Data
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
When Big Data uncovers No Data – “not there”, “null”, or simply “nothing worth using”…
I’ve been researching Big Data at MWD for the last couple of months now. In my introductory report* I stuck to the script and focused on technologies for managing Big Data, because that’s commonly what people have the most pressing need to understand… but where there’s Big Data, there’s also often “No Data”.
You may have been ignorant of No Data before your Big Data initiative highlighted it (and you may still be!). In some cases, understanding the implications of No Data can bring insightful corollaries to the complex analysis of Big Data because it can tell you why something isn’t happening (or at the very least point out the gaps in your coverage of what is so you can moderate your results).
How much this affects you depends on what data you’re missing, and what you’d do with it if you had it. It’s always best to start with a firm view of the business decisions, goals and opportunities you’re trying to support with a Big Data initiative, then track back to the data sources that will help you underpin these efforts – then determine whether these are known or unknown to you; where you might encounter No Data; and if so, what sort of “no” it represents (and whether it even matters – it won’t always!). If data can be said to have ‘gravity’ and doesn’t like being slung around because of the latency costs, “No data” can also be thought of as exhibiting a gravitational effect on your results (like a distant, dark exoplanet can tug on its star). Whether that’s appreciable or not depends on your use case and tolerance for error.
When it comes to the types of No Data you may encounter, sometimes data is simply “Not available” – i.e. from the perspective of visibility for processing, it simply doesn’t exist. That could be because it’s still too hard or too costly to reach, or perhaps it hasn’t even been collected or sampled in the first place. Sometimes there’s No Data because what you have is a “Null data value” – i.e. there should be something there but, for whatever reason, there’s a NULL in its place. This might be because something isn’t applicable and so “NULL” is a perfectly acceptable (and expectable – so it can be flagged, come analysis time) value to have; but it may be that “NULL” was just a default that hasn’t been over-written and you’re really expecting something else to be there (and so no account would have been taken for its absence during processing) – which would make that data ‘dirty’. Sometimes, of course, the data’s dirty for other reasons (noise, error, fraud, etc.) and so effectively what you have is “Nothing worth using”.
What you do about No Data depends on that type of ‘no’: “Not available” may arise because of a curation issue (e.g. there’s ‘dark data’, that hasn’t seen the light of day for quite some time, which could be moved from long-term store onto a Hadoop cluster to improve its visibility); or a collection issue (which could be addressed by, for example, putting in place measures to sample new real-time streams, or by instrumenting products and services to generate data where previously there was none). To address issues of “Null Data” you’ll need to invoke validation measures that improve data quality so only the ‘right sort of nulls’ are captured (and correctly interpreted). Lastly, if you think you’ve uncovered pockets of “Nothing worth using”, it’s time to consider the veracity of your data source – and from there you’ll be straying into the whole “Clean Data vs More Data” debate!
In fact, your stance on clean data vs more data (which your end goal will have a bearing on) may dictate whether you do anything about No Data at all. You may well be happy that you have a ‘good enough’ answer with the data you do have, for instance. However, if you are at all serious about No Data’s implications, then the search for it really needs to be factored into the data preparation phase of any Big Data analytics initiative you undertake. It’s here that you should identify your data wish list, determine which sources it makes real business sense to seek out, how much effort it’s worth expending to do so, and what to infer if those wishes can’t be fulfilled.
With that level of preparation under your belt you can at least be sanguine about what data you can’t turn up, and cognizant of the effect its absence might have on the insight born of the data you can.
Opinions expressed by DZone contributors are their own.