Q and A about Dark Data
Q and A about Dark Data
Having structured data management processes helps organizations to better use their data for BI and analytics.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
I had the opportunity to share a few questions with Shahin Pirooz, CTO of DataEndure regarding the collection and use of dark data by companies today.
How do you define “dark data?”
Dark data, as defined by Gartner, is “the information assets that organizations collect, process, and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing).” The name alludes to dark matter, the unknown, un-seeable material in space, and highlights how little companies use or think about their data. Every company will generally hold on to all the data they collect, regardless of its usefulness. There are 40,000 search queries per second on Google, and a billion daily Facebook users. Data is being uploaded and shared and processed and stored; and only sometimes accessed. How does a company deal with all this data, especially the dark data?
Sometimes dark data holds highly valuable information that organizations aren’t taking advantage of. Companies are leaving invaluable information about product consumption and potential revenue streams on the board by ignoring their troves of dark data.
Is “dark data” a good thing or a bad thing for companies to have? How much do companies spend storing “dark data?”
Dark data is a great thing, but only if it is consumed, categorized, processed and analyzed for important business metrics. Conversely, the massive amounts of unorganized and unanalyzed dark data can slow the success of business operations. Many organizations report feeling they have “inadequate reporting capabilities” and “disorganized content management.”
Storing poor unorganized data costs businesses up to 35% of their operating cost. The amount of poor and unorganized data being stored costs $600 billion per year for U.S. businesses alone. Companies can save millions of dollars by migrating unaccessed or trivial data to Tier 2 storage or to the cloud.
How is “dark data” being used to improve a business?
There are two direct advantages of dark data if identified, classified and consumed appropriately. The first is the business advantage that can be culled from the large amounts of business metrics that can be pulled out of this dark data to improve business decision processes. The second is the significant cost savings a company can benefit from by simply moving the unused dark data to second tier storage. This frees up critical and expensive storage for primary workloads and moves unused data into secondary cheaper storage or cloud storage.
How is “dark data” being used to improve the customer experience?
Dark data is found in customer call detail records, server log files, mobile data, survey data, invoices, POs and email. Emails are one of the largest sources for mass amounts of dark data. Properly classifying and leveraging this data will provide great insight into the voice of the customer, customer consumption, and product strategies.
What best practices do you and your company recommend for managing “dark data?”
With this exponential growth in storage, organizations are simply relying on employees and IT to manage the increasing volumes of data. Consequences of using incorrect data management and not having automated processes in place can have a huge impact on organizations. Their Tier 1 storage costs and data center footprint in most cases are much higher than they should be. The best practice as a starting point is to identify and classify the dark data and dynamically move unused data to secondary storage. Beyond this first step is the further analysis of the data as a whole to ensure you are not leaving critical business insights on the board.
What’s the future for “dark data?”
The future will hopefully involve improved data management. If you look at where we are at now, more data has been created in the past two years than in the history of the human race. On Google alone, there are 40,000 search queries every second. That means dark data will only continue to accumulate. In the future data will dynamically move to either primary storage for performance and secondary or tertiary storage for archiving. The future will also include a Big Data play where this dark data will have a light shown on it and important business insights extracted to provide a competitive advantage.
How do you recommend developers handle “dark data” when developing applications?
I recommend developers consider the implications of dark data. When data gets lost, opportunities are missed and often times work has to be re-done. It’s important to build a common taxonomy of data types starting with the applications and working with data optimization and classification products to better classify data from cradle to grave.
What else is important for developers and engineers to know about “dark data?”
It’s important to know that additional data accumulated slows down IT infrastructure. A great deal of data security issues reported are due to the accumulation of additional and unorganized data. Having structured data management processes also helps organizations to better use their data for BI and analytics.
Opinions expressed by DZone contributors are their own.