Clean Data vs More Data
Clean Data vs More Data
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Are you happy with your wash? Or are you deeply troubled by ‘dirty data’? Fear not… your real problem might simply be that you don’t have enough of it! A Big Data outfit might still be for you, as long as you’re careful what occasions you wear it for, and adhere to the washing instructions on the label.
Before I dove deep into Big Data for MWD, I was all set to follow my instincts on data cleanliness and subscribe to the traditional “Big Data with Bad Data is a Bad Idea” ethos… but it seems there’s a new paradigm on the block, and things aren’t as clear cut as you might have imagined.
The answer to bad data? Get more data! Well, as long as you’re completely clear on why you’re doing it and who you’re doing it for, anyway… best not to get carried away, eh? The idea with the ‘more data’ approach is that, rather that expend significant resources on cleaning and consolidating data prior to analysis, you should instead simply hoard more of it (i.e. the ‘data lake’ approach). The premise here is that a ‘good enough’ signal will eventually rise up from the noise (or, in the case of ‘bad data’ because of ‘no data’… from the silence – but that’s another story!).
Beware, though, that a ‘yet more dirty data’ approach may stick in the craw of DBAs, data stewards, data architects, etc. with views similar to my own original default position. What we’re saying now is “well, it might be a Bad Idea – it just depends who’s asking, what the question is, and where you’re looking for the answer”. Provenance also comes into play – how well do you know the data, and know of its likely bias or noise components (and can you account for that by weighting your analysis)? All of this could require a bit of a re-think on data management processes and procedures!
Which approach is right for you very much comes down to characteristics of your use cases and end-goal. A cleansed Big Data source is still essential for highly curated tasks where organisations are performing more formalised business reporting and analysis (for instance, around sales, finance, marketing, regulatory requirements etc.). Here the business audience needs to inherently trust the data, and can accommodate the hit on speed of delivery which that extra scrubbing will inevitably incur. On the other hand, a ‘data lake’ approach (and comfort with ‘unknown unknowns’) fits better where the analysis work is more exploratory and the goal is discovery of new insights (and speed is of the essence – i.e. for now, ‘quick and dirty’ will have to do).
Bear in mind too that these scenarios will likely nestle side-by-side in any company and so your choice of ‘clean data’ vs ‘more data’ approaches will need to complement each other. You’ll probably find yourself quantising the continuum of data cleanliness into buckets of data that can handle different amounts of dirt depending on what’s going to become of them.
Some of your Big Data use cases might resemble cats with obsessive grooming habits; others, more like oysters – content to craft pearls of wisdom out of whatever dirt they find in their shells. Perhaps I should have titled this blog “Cat vs Oysters”. It’s time to get to know what kind of an animal your data is up front… it could certainly save you some bother in the long run!
Published at DZone with permission of Angela Ashenden , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.