The Skinny on Big Data: Everything You Need to Know From Our CTO
If you're here to get an introduction to the field of big data, read on for great insights into the Four Vs, data cleaning, and more.
Join the DZone community and get the full member experience.Join For Free
The term "big data" comes up frequently in articles and office discussions, but what does it really mean to utilize big data?
Big data: it’s one of those buzzwords you can’t seem to get away from. Though you might have an idea of what it means, there’s likely plenty you don’t know about the ins and outs of what it takes to really harness the power of big data.
We sat down with Robert Swisher, Chief Technology Officer at Business.com, to get a better idea of what the big deal is with big data. He breaks down what it all means and how to make it work for you.
What Exactly Is Big Data?
RS: It’s basically a large set of data. People use different terms, but it’s just a huge amount of data —structured and unstructured — that’s coming in at a high velocity and a big volume, and a lot of times it’s not that “clean,” so you have to manipulate, sanitize, and covert that data to clean it up and make it usable.
At its core, though, it’s just a gigantic set of data.
What Could That Data Be, Exactly?
RS: So, for example, it could be all the point of sale data for Best Buy. That’s a huge data set — everything that goes through a cash register. For us, it’s all of the activity on a website, so a ton of people coming through, doing a bunch of different things. It’s not really exactly cohesive and structured.
With point of sale, for example, you’re looking at what people are purchasing and what they’ve done historically. You’re looking at what they’ve clicked on in email newsletters, loyalty program data, and coupons that you’ve sent them in direct mail — have those been redeemed? All these things come together to form a data set around purchasing behavior. You can look at what “like” customers do in order to predict what similar customers will buy as well.
What’s Your Take on Why Big Data Is Such a Big Trend Now as Opposed to Years Prior?
RS: I think that the technology needed time to evolve. The core technology that’s used for big data was developed about ten years ago. There’s the software component that allows you to manage these data sets, and there’s the hardware component of storage and compute costs that have been getting cheaper, which makes big data more accessible for businesses. They can now make use of their large datasets with off the shelf, open source technology.
What Are the Most Commonly Held Misconceptions About Big Data?
RS: In my opinion, people think it’s this magical thing. They think, “We’ll just turn that on and now things will just work and we’ll know all this stuff.” But it’s just not that simple — it’s actually really complicated and you need the right equipment and people that understand how to analyze and work with big data.
Increasingly, simplified tools are coming out for non-technical users to create dashboards and get some of the information they’re looking for, but it is a really specialized skillset. It’s not something you can just turn on and have. There’s an investment in people, time, and hard costs to make this stuff work.
Would You Say That the First Step Would Be to Determine What Exactly You’re Measuring?
RS: That would be one way to do it. The other way to go about it would be to make a list of what types of data you have that you’re not making use of. Ask yourself, what are all the different types of data that we collect on a regular basis that we may or may not be doing things with, and how can we combine those to find intersections? How can we analyze them? This will also help you determine where there are gaps in the information you collect.
You Talk About the 4 Vs. Why Does Each of Them Matter, and How Do You Measure Each?
RS: The volume of your entire data set — meaning everything coming in — is probably measured in gigabytes or terabytes, which is storage on disk scale.
Velocity is the rate at which the data is coming in, and it would be measured in units like records per second or bits per second, for example.
Variety means that you have a bunch of different pieces of information that you’re putting together to build a cohesive model around what you’re looking to solve or understand.
Veracity means that often, data is unclean and you have to deal with that. There’s no metric that I’m aware of to measure it, but it’s important.
To That Point, What Makes Data Unclean?
RS: A good example is junk. Let’s say people are submitting email addresses, and a lot of times, there are typos, misspellings, or they're not real. Anytime that you’re looking at things that are based on user input, there are often a lot of mistakes or just blatant, false information.
How Would You Get Started in Big Data?
RS: You either need to have engineers and tools in-house, or you need to find a consultancy or firm that specializes in it. The latter can come in and help you get it set up and get you started, which is a good route.
There are some off the shelf platforms that can give you some insight, like GoodData and Tableau, where you can plug in the data sets that you have for a monthly fee. Their dashboarding functionalities help non-technical users to create charts and graphs and to look for trends to analyze.
Published at DZone with permission of Juan Koss. See the original article here.
Opinions expressed by DZone contributors are their own.