Taming the Data Variety Beast
Managing the variety of big data sources is becoming a greater challenge for enterprises, but help has arrived.
Join the DZone community and get the full member experience.Join For Free
“Big data” has always focused on the three V’s: volume, velocity and variety. However, with much of the initial focus centering on the volume and velocity of data, enterprises have only been able to pay limited attention to managing its variety. With IDC predicting the worldwide IoT market will reach $1.7 trillion by 2020, it’s easy to see why getting a grip on data variety in order to create a 360 degree view of their assets will be critical for organizations hoping to gain important business insights.
New IoT technologies are allowing companies to easily bring data from their plants to industrial clouds like Predix. Similarly, due to improvements in security, data privacy and availability, organizations are becoming more confident in consolidating data from multiple sources into an industrial cloud for analysis instead of keeping this data in traditional, functional data centers. These shifts are leading to increased volume and variety of data streams that organizations have the opportunity to manage and mine for new insights.
In addition, as organizations get a better understanding of the value of their data and the insights that it unlocks, the enormity of the data that needs to be captured and analyzed from such a wide variety of sources can become overwhelming. Most organizations don’t even know the scope of what they have — data sources, entities and attributes — let alone how to get them to work together at scale to power new findings and discover new business opportunities.
Aggregating the Data
Technology advances driven from the consumer internet experience are helping enormously in addressing the volume and velocity of big data by providing low cost, highly resilient elastic storage techniques. In this environment, commoditized storage technologies can be used in very robust applications as the data is replicated to eliminate any single point of failure. In addition, technologies such as Hadoop lend themselves to massive parallel processing of this data in order to run analytics against big data stores in a fraction of the time it takes using traditional server-based techniques.
However, the issue of data variety remains much more difficult to solve. The problem is especially prevalent in large enterprises, which have many systems of record and also an abundance of both structured and unstructured data. These enterprises often have multiple purchasing, manufacturing, sales, finance, engineering, and other departmental functions in various subsidiaries and branch facilities. This functional separation means they often end up with siloed systems.
Companies have traditionally tried to use large data warehouse techniques to consolidate data from multiple sources. However, these techniques have been problematic. The set of questions to ask the data needs to be pre-determined in order to design the structure of the data warehouse. Many of these initiatives struggle to succeed, as the real value of big data is to learn what questions to ask based on hidden correlation and insights that can only be discovered once the data is analyzed. In addition, these data warehouses rarely make use of modern elastic compute and storage techniques essential for analyzing data at scale.
Contextualizing the Data
Another challenge with data variety lies in putting the data into the right context. Nothing exists in isolation in today’s networked world, as most of the big data available for analysis is linked to outside entities and organizations. Often this context is provided in different formats, names or structures based on the way the original data source was managed, which can vary from source to source. Therefore, there are two main challenges to contextualizing data.
The first is surrounding the data with enough information to know where it came from (enterprise/plant/equipment /function/data entity); the second is to semantically reconcile the data to a common language to be used in the big data analysis.
With so many data variables in place today, creating this broad system-wide view isn’t easy. However, data curation is one way to attack the variety issue. Combining machine learning and advanced algorithms that seek "high confidence levels" and data quality, while cross-referencing and connecting data from a variety of sources into a condensed single source, is one way to do this. The end result is a continuous system of reference that can adapt to the variety of data being processed by large organizations.
Analyzing the Data
Once the data is collected, aggregated, semantically reconciled, and the relationships between the different data sets understood, organizations can begin to get a 360 degree view of their assets, from design through maintenance and service, in order to better optimize their business.
The ability to get this complete view exposes managers and operators to the continuum of an organization’s workflow — from a piece of equipment or product, to the entire plant, or even a specific person — and allows them to ask questions they couldn’t ask before. For example, they can now predict whether a potential supplier change causes an increase in the amount of service that will be required for a given product.
Until a scalable and viable way to address the “high-variety” part of the big data challenge is realized through technology, organizations will have to rely on people to manage the process. This will keep the cost of big data initiatives high and limit their application in new environments. Ironically, this is where the value of these initiatives may be most significant. The emergence of big data technology has driven many efficiencies in enterprises leveraging analytics, but it really takes all three V’s of big data — volume, velocity, and variety — to deliver on its true promise.
Opinions expressed by DZone contributors are their own.