The Pure Beauty of Dirty Data
The Pure Beauty of Dirty Data
The process of cleaning data can be time-consuming and quite expensive, and yet companies are still emphasizing it as a pivotal step in analytics usage. Doing so may place too much focus on clean data while ignoring the possibility that dirty data has its own beauty that can add value to an organization.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Talk to any business serious about implementing big data analytics solutions, and they’ll inevitably talk about the need to gather as much data as possible, then clean it. Cleaning data is usually associated with attaining the highest quality data possible. It’s reasonable to assume that the cleaner the data, the better the analytics results will be. With this in mind, businesses are spending large sums of money ensuring that the data they use meets a high standard.
But what if that money could be better spent elsewhere? And what if the traditional thinking regarding so-called dirty data was wrong? The process of cleaning data can be time-consuming and quite expensive, and yet companies are still emphasizing it as a pivotal step in analytics usage. Doing so may place too much focus on clean data while ignoring the possibility that dirty data has its own beauty that can add value to an organization.
Admittedly this is a minority view on the subject of dirty data. Whether a company is a retail outlet, a healthcare provider, one of many converged infrastructure vendors, or a team of researchers, most will believe dirty data should be cleaned at every opportunity, even if it is labor intensive. That requires getting rid of redundancies, outliers, errors, and other mishaps to give organizations a nice clean set of data to work with, giving them a better chance of gaining hidden insights. But this might not always be the case. As John Showalter, MD, the chief health information officer at the University of Mississippi Medical Center explains, dirty data can still be trusted to deliver accurate results. In fact, it may not do much to improve analytics at all. Taking a low effort approach may, in fact, be more cost effective than spending the time cleaning data sets.
That’s not to say data teams shouldn’t be concerned over the accuracy of the data they do end up using, but rather that dirty data can still be accurate enough for organizations’ purposes. The emphasis should be less on the cleaning process and more on actually acting on the insights data scientists are able to uncover. Data, after all, is only information -- what you do with it is far more important. For obvious reasons, allowing large errors to creep into data sets should not be tolerated and basic cleaning principles should be followed, but going to the painstaking effort of making data sets pristine may not give businesses the return they’re hoping for.
Indeed, the idea of making a data set perfect may be a mistake in itself since there’s really no way to achieve the goal of perfect data. In fact, dirty data may actually be used as an excuse by some businesses to avoid having to act on the big data they do manage to collect. To be clear, mistakes and errors will always be part of big data. Some of those mistakes may be small, but trying to clean all of them out of a data set is an impossible task. Yet organizations still attempt to do so, depriving them of the opportunity to use data for business objectives. Many of the errors are truly harmless and won’t do anything to damage analytics efforts. On the occasions where errors can negatively affect analytics, they can usually be dealt with as needed.
Some experts even hold to the idea that removing apparent errors in data sets may block organizations from uncovering insights needed to take them to a different level. Outliers, for example, may not be mistakes but rather early indications of rising trends. Ignore those and you would be missing out on valuable information that could direct business decisions for the foreseeable future. Dirty data, in other words, may actually hold the insights organizations wouldn’t normally get were they to only focus on cleaning their data.
The idea behind clean data isn’t necessarily misguided, but dirty data has plenty of value that shouldn’t be overlooked. Ultimately, big data analytics is already equipped to deal with data that has incomplete sets, repeated values, and mistakes. Spending time and money trying to clean all of that up is basically time and money that can be put to better use for other projects.
Opinions expressed by DZone contributors are their own.