Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How to Collect Meaningful Data

DZone's Guide to

How to Collect Meaningful Data

Collecting data isn't hard. But collecting the right data is. The only way to collect the right data — meaningful data — is by defining the goals for what the data is to achieve.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

The ability to gather meaningful data is as important as the insights the data can generate. Those insights, the end result of any data collection, is what people see and judge.

The hard truth here is that bad data leads to bad decisions. Thus, it is important to take the time necessary to build a proper data collection process. Two weeks ago, as I completed my big data certification, the importance of proper data collection became clear. It also reminded me of some basic data collection techniques I learned during Six Sigma training. That's what I want to share with you today.

There are many benefits to building a proper data collection process. The primary benefit will be to the teams that need to sift through the data for insights. The sooner they get value from the data, the better. This saves time and money for everyone involved.

Having a proper data collection process allows you to document what data is being collected, by whom, and for what purpose. Your data collection process should be part of a larger data governance strategy. Unfortunately, data governance is one of those things that happens after a company grows to a certain size. (So is data security, but I digress.)

Here's a simple process outline for you to review. It's worked well for me over the years. Feel free to adopt or change for your own needs. Use whatever you can to build a data collection that helps you gather meaningful data.

Define Goals

Collecting data isn't difficult. But collecting the right data is hard. And the only way to collect the right data, meaningful data, is by defining the goals for what the data is to achieve. You need to ask the basic questions: who, what, when, where, why, and how. You must have an idea of what problem the data needs to solve.

Data stratification is helpful here. Stratification is the grouping or sorting of data into sub-groups. You can use stratification to identify the requirements of each group that will consume the collected data. This will help you better understand the nature of the problem you are trying to solve. Each team will understand how they get data, and how that data relates to the larger data collection.

Warrants mentioning: GDPR compliance starts in three weeks. Be mindful of collecting personal data.

Defining Operational Procedures

Operational definitions and procedures define the measures for your data collection process. This is where you list who is collecting the data, why it is being collected, and the tools used. You should also specify if methods such as sampling are being used. Often times it can be difficult to collect an entire data set, and sampling may be a viable method. It's quite possible to learn a great deal of information through a small set of data.

Validating Metrics

Next, you define how to confirm the data collection process is running as expected. This is where you need to use some math. You could measure load times, or batch processing times, or sample your data to make sure that the set is valid.

The goal is to put into place a series of control reports. These reports should allow for anyone to know if something is wrong with the data collection process. This could include a check for outliers or even garbage data.

Data Collection

You are now ready to start collecting data. This is where you want to think about how to present the data to end users for consumption.

Your data has a story to tell. If you can't, or won't, put your data into a visualization then you aren't collecting the right data. It could also be the case you don't know what problem you are trying to solve. In either case, go back to the start and redefine your goals.

Continuous Improvement

Once your collection is running you must review your control reports on a regular basis. You should build alerts so that you know immediately if something is wrong with the data collection process. The earlier you know of the issue, the better.

Use this phase to also determine if the collection needs modification. Your end users may have new requirements, for example. Check in with them and make certain their needs are being met. If you need to make adjustments, update your defined goals and process as needed.

Summary

I know, this sounds like I've taken something simple and made it more complex than necessary. After all, collecting and storing data isn't hard. We do it all the time.

But to me, the key difference here is that this process aids in the collection of meaningful data.

Look, it's not difficult to collect ALL THE DATUMS when you need to troubleshoot an issue. It's easy to build a process that collects every metric possible, in the hope that it might prove valuable later.

My point is this: You should collect the right metrics, not every possible metric.

Without a proper data collection process, you end up with way more noise than signal. This is a waste of bits on the wire. It could also lead to compliance and regulatory issues.

If something is worth doing, it's worth doing right.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
big data ,data collection ,data analytics

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}