DZone
Big Data Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Big Data Zone > Clean Data vs More Data

Clean Data vs More Data

Angela Ashenden user avatar by
Angela Ashenden
·
May. 13, 14 · Big Data Zone · Interview
Like (0)
Save
Tweet
2.18K Views

Join the DZone community and get the full member experience.

Join For Free

Are you happy with your wash? Or are you deeply troubled by ‘dirty data’? Fear not… your real problem might simply be that you don’t have enough of it! A Big Data outfit might still be for you, as long as you’re careful what occasions you wear it for, and adhere to the washing instructions on the label.

Before I dove deep into Big Data for MWD, I was all set to follow my instincts on data cleanliness and subscribe to the traditional “Big Data with Bad Data is a Bad Idea” ethos… but it seems there’s a new paradigm on the block, and things aren’t as clear cut as you might have imagined.

The answer to bad data? Get more data! Well, as long as you’re completely clear on why you’re doing it and who you’re doing it for, anyway… best not to get carried away, eh? The idea with the ‘more data’ approach is that, rather that expend significant resources on cleaning and consolidating data prior to analysis, you should instead simply hoard more of it (i.e. the ‘data lake’ approach). The premise here is that a ‘good enough’ signal will eventually rise up from the noise (or, in the case of ‘bad data’ because of ‘no data’… from the silence – but that’s another story!).

Beware, though, that a ‘yet more dirty data’ approach may stick in the craw of DBAs, data stewards, data architects, etc. with views similar to my own original default position. What we’re saying now is “well, it might be a Bad Idea – it just depends who’s asking, what the question is, and where you’re looking for the answer”. Provenance also comes into play – how well do you know the data, and know of its likely bias or noise components (and can you account for that by weighting your analysis)? All of this could require a bit of a re-think on data management processes and procedures!

Which approach is right for you very much comes down to characteristics of your use cases and end-goal. A cleansed Big Data source is still essential for highly curated tasks where organisations are performing more formalised business reporting and analysis (for instance, around sales, finance, marketing, regulatory requirements etc.). Here the business audience needs to inherently trust the data, and can accommodate the hit on speed of delivery which that extra scrubbing will inevitably incur. On the other hand, a ‘data lake’ approach (and comfort with ‘unknown unknowns’) fits better where the analysis work is more exploratory and the goal is discovery of new insights (and speed is of the essence – i.e. for now, ‘quick and dirty’ will have to do).

Bear in mind too that these scenarios will likely nestle side-by-side in any company and so your choice of ‘clean data’ vs ‘more data’ approaches will need to complement each other. You’ll probably find yourself quantising the continuum of data cleanliness into buckets of data that can handle different amounts of dirt depending on what’s going to become of them.

Some of your Big Data use cases might resemble cats with obsessive grooming habits; others, more like oysters – content to craft pearls of wisdom out of whatever dirt they find in their shells. Perhaps I should have titled this blog “Cat vs Oysters”. It’s time to get to know what kind of an animal your data is up front… it could certainly save you some bother in the long run!

Big data

Published at DZone with permission of Angela Ashenden, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • What Is Lean Software Development
  • Testing Schema Registry: Spring Boot and Apache Kafka With JSON Schema
  • An Overview of Key Components of a Data Pipeline
  • Stupid Things Orgs Do That Kill Productivity w/ Netflix, FloSports & Refactoring.club

Comments

Big Data Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo