Do You Know Your Data?
Do You Know Your Data?
The difference between storing data and having insight into your data can be the difference between your business succeeding and failing in a competitive market.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Most businesses understand the need to store data and most now are in fact doing so to one degree or another. Many enterprises, however still don’t yet fully understand how to leverage the data they are storing in order to extract the available actionable business value; nor do they have the automated processes in place to accomplish this.
Just storing data is not enough. Without keeping an adjourned picture of the critical business information that the data is conveying and contains means that your enterprise is missing out on potential insights that can have significant impacts on revenue, profit, risk avoidance and issue mitigation/remedy, the surfacing of appropriate/desirable content, producing optimal search results, and discovering and achieving realizable enterprise process cost savings.
Not Knowing Your Enterprise's Data Puts Your Business at Serious Competitive Risk
It is not enough to just put into place the required data ingestion pipelines, and store the data that you are collecting. To be able to get the data to actually work for you, to gain timely insights, and capture actionable value contained within it, the data needs to become analyzed and rolled up into useful clear signals. Only then it can it realistically be reported on, interpreted, and be consumed by processes or enterprise-scale decisional and monitoring systems.
If your data is just accumulating dust in some database or other repositories, it is doing nothing of actual value for you, and your business is not benefitting from the insights and signals it contains. If you don’t develop and utilize insights from the data you have, those competitors, who do leverage their data may begin outperforming and outcompeting your business.
In order for your data to be actionable, it needs to first become transformed into corresponding consumable and quantifiable artifacts, often by becoming factored into things like scores, classifiers, etc., that are representative of some set of underlying evolving data. If this can be achieved, then higher order software, that is itself observing these generated quantifiers and any change events generated by these systems can benefit from having these key metrics that provide a dynamic window into the actual significance of some measured aspect. The dynamically adjourned classifiers, which domain and enterprise level software systems are watching and ingesting into their algorithms are representing and quantify the extracted value contained within a much larger mass of data.
For example, many online businesses, including even some large websites that collect vast amounts of web usage data, are not successfully leveraging this usage and user information to help inform and drive their real-time enterprise. While they may periodically run ad-hoc analytics queries on the data, and make sporadic attempts to quantify and characterize it, they are not systematically processing the incoming data streams, which they are collecting, to create and then maintain an up-to-date picture of the current usage patterns, potential issues, and problems, as well as their individual user’s preferences and tastes.
The Data Story Is Dynamic, and So the Data Pipelines Must Be Dynamic as Well
In their day-to-day business activities, many companies are basically running blind and hoping for the best. This is true, even when they have some course grained analytics snapshot, which, though better than no insight at all, may not reflect the actual state of a dynamic system.
Instead, by putting fully automated data pipelines in place, they could be making better decisions, informed on the basis of real-time quantified attributes and flags, that are being continuously updated by actual real-time activity.
The real picture of any enterprise is dynamic, always evolving, and responding to external and internal changes. While producing static snapshots is better than doing nothing, without putting in a fully automated self-updating system for generating and updating classifiers and signals, the enterprise cannot fully benefit from the streams of potential value that are occurring in real time.
Reality is dynamic… and the corresponding data ingestion, processing, and reporting/signaling pipelines must be as well.
We Live in an Age of Big Data
As the unit costs of digital data storage have plummeted, and continue to decline, the amount of data being stored has correspondingly exploded. Enterprises, by and large, understand that they need to store the data they create in logs, file servers, databases, etc. They are doing this at record rates. More and more data is being ingested, including huge volumes of telemetry data produced by IoT devices.
This is, of course, a good practice, unless, often for valid reasons, such as privacy concerns, for example, some portions of a given body of data or ingested streams of data should be hidden. Even in such cases, the data is being stored somewhere, but not leaked out from a given system or security context.
Enterprises are, by and large, doing a pretty good job of collecting and storing data. In fact, so much data is getting stored that a new software discipline, termed “Big Data” has grown around the specialized techniques that are required, in order to be able to effectively work with massive quantities of records.
But is getting the data into the database (or another type of structured repository such as one of the NoSQL solutions) enough? While collecting the data is better than nothing, having data in a structured form enables querying into it and makes it discoverable.
Having Data Is Not the Same as Knowing Your Data
The question decision-makers need to be asking themselves is: To what extent is ingested enterprise data becoming discovered, surfaced and made visible, in an actionable form?
Many data practitioners will recognize and are all too familiar with this problem of having reams of data that, in theory, are available, but that, in effect, remain invisible to the enterprise. Huge tables, entire columns or rows of data, etc., can remain, for the most part, unexplored, existing in a state of limbo or a kind of terra incognita. In theory, especially if it is structured data, it can be queried and quantified, but, all too often, no query has been run or is being run to do this, and thus this data, though it has been recorded, and is available, contributes nothing to the actionable insights and success of the enterprise.
Decision makers should ask themselves if they can afford to ignore their data, to collect it, but then leave it in this limbo state of being unreported on. They need to begin wondering if they are missing out on critical signals or insights because they actually only know a relatively thin slice of all the data they have.
Knowing Your Data
While each body of data can present different challenges, a few basic approaches seem generally applicable. I am outlining some that I feel are widely shared. It is by no means complete or exhaustive and others may have different views about what belongs in a list like this… and that is fine. In fact, I’d love to hear them!
Explore the Data You Have
First off, explore the data. Go in and take a look at random rows and columns. If it is a relational set, or for example, JSON documents, representing records of some sort, explore any schemas, indexes, etc. Get a picture of the existing data and data structure.
Quantify and Classify
Once you have formed a good picture of the data you wish to quantify, and have decided which columns, tables, etc. you want to track, you can begin figuring out how to ingest it for the purpose of classification, for example, by running a query.
Often you will be trying to classify something, or conversely adjust some classifier as a result of new data. Many techniques and algorithms exist, and each problem has its own specific requirements, and needs which must be met. However, the underlying activity is essentially similar.
The process of quantifying, or classifying data, usually involves the following sequence of steps:
- Raw input data exists, or is continuously being ingested, or produced (an event stream, or usage log, for example). This data is also associated with some identifying criteria, such that it is granularly addressable. In other words, just saying it is in the database won’t cut it. You need to be able to address what is getting measured, down to the atomic level. It is not enough to know approximately where to find it. If you cannot uniquely address your data items, you cannot reliably find it, and it is therefore of little value.
- This data (or stream) is then processed by one or more analyzers, which extracts useful and actionable information out of it. For example, it may classify some item or adjust some composite sentiment.
- The newly generated, or more typically updated computed artifact, which could be a score or classification, for example, is itself stored (updated) in a database or other repository system.
- The generated quantifier or classifier entity is then wired up to and exposed to interested decisional, monitoring, and reporting systems.
Deciding What Your Data Means
It is these higher-level recommenders, and other types of decisional support systems, such as monitoring or reporting and alarming systems, that are interpreting the data and are responding to relevant signals, and events contained within it.
While decisional and similar type software often do ingest raw data, the performance of these types of analytics engines can be much improved by pre-processing raw data into quantified rolled up representations of the current state of a set or group of data items.
For example, a consumer sentiment score about some tracked dimension of particular interest to an organization might be composited from various columns, tables, and other sources, and could represent a fairly large set of granular pieces of data which have been processed by some algorithm into being rolled up into the published sentiment classifier. This may itself be as simple as a floating-point score, or an enumerated bucket label.
Keep it All Working
After you have accomplished the following:
Explored, and decided on what it is that you want to be reporting on.
Developed the requisite set of queries and software required to roll up the jumble of raw data into a corresponding set of representative, quantified values.
Wired up your analytics engines, monitoring and alarming packages to ingest these generated representative artifacts — scores, classifiers etc., you need to keep it running.
This is not a one-off, nor is it a trivial task. Data is always changing and evolving, so the systems that consume it need to respond and evolve accordingly. Most enterprises use multiple packages, often from various vendors, to try to accomplish this goal. I strongly recommend that a full suite of unit, integration, and regression tests be written to exercise the possibly large number of processes, and services involved in an enterprise scale system, so that it can be evaluated and monitored. In addition, a robust deep suite of tests also helps manage and smooth the process of continuous evolution and change.
Knowing Your Data, Classifying and Quantifying it
It is also true that different organizations are at very various places in terms of what they are already doing, and how mature their current systems are. There really is no end to this process. There is always going to be more work that needs to be done to improve existing systems (or the robustness and resiliency of these critical services). As with many processes, it is a journey, and there will always be more areas to explore. Journey though it may be, it is useful to ask yourself where you are right now. What is the current picture? And… do you even know what the current picture is?
Each person or organization, even different people within the same organization, are going to have different challenges and will be in different contextual states. They each will also have different views into the organization and its data. In other words, the picture is often quite complex, fragmented, and multi-dimensional. and will vary depending on a given perspective.
But, in general, it’s never a bad thing to ask oneself, how well do I know my data?
Decision makers are increasingly realizing that significant opportunities exist for realizing additional potential revenue, higher profit margins, increased customer satisfaction, and agile problem avoidance and mitigation, by putting into place continuous long-running processes that are adjourning the current evolving business consumable picture of enterprise data, updating key quantified metrics, and making these signals rapidly available to and seamlessly integrated into enterprise decisional systems.
Really knowing your data is the key to profiting from it.
Opinions expressed by DZone contributors are their own.