Why Big Data Should Also Be Smart
In this article, see why big data should also be smart data and see why it's not always useful.
Join the DZone community and get the full member experience.Join For Free
Initially, many Data Scientists and specialists from related professions are very enthusiastic about Big Data. A couple of years later, most of them become much more skeptical, and the term Big Data itself turns into a buzzword. Why is Big Data as such not a value, and why is the quality of even the necessary data more important than the quantity?
Big Data Isn't Always Useful
Quite often, Big Data is perceived as a kind of treasure, a valuable resource that makes it possible to create effective strategies, optimize processes, etc. However, the more experienced the analyst becomes, the more specific questions they ask. What exactly can we learn from this data? Do we need this information now? How much will it cost to store data if we don't need it now?
Working with Big Data requires considerable computing power. With the development of cloud storage, computing power has become more affordable, but its maintenance still requires resources. The same data can be very valuable to one company and completely useless to another. And in the latter case, it will become nothing but a liability. To avoid this, it is necessary to analyze its usefulness even before collecting and sending it to the repository.
The vast majority of all Big Data in the world is currently garbage. This data is either completely useless for those who currently own it or it contains so little useful information that it doesn’t even cover the cost of its processing. According to a Forrester study, companies actually use no more than 40% of the data they collect.
Bigger Is Not Always Better
The “just throw as much data into AI as possible” tactic no longer works. Data Scientists understand that not every function is useful, and the quality of data is more important than the quantity. There is a need for only the data that helps to analyze what is important at the given moment. Only by working with quality data, can AI give useful results.
Along with the data itself, there is a need for infrastructure to safely analyze, use and transfer data, and separate useful information and garbage. Not everyone has realized this yet, but data should be not only big but also smart.
Why Data Should Be Smart
Big Data has five key parameters:
The value of data does not always depend on its volume or velocity, but it does affect other parameters. If the data is not various, not veridical, and not valuable at the moment, there is no point in collecting it.
The Wired portal defines Smart Data as follows:
“Smart Data” means information that actually makes sense. It is the difference between seeing a long list of numbers referring to weekly sales vs. identifying the peaks and troughs in sales volume over time.
In practice, Smart Data is a piece of data that can be used at a given moment in order to meet the specific needs of the company. Smart Data is also the part of Big Data that is used in presentations and based on which decisions are made.
Why Non-Smart Data Is Useless and Even Destructive
Imagine that two Data Scientists are working on the implementation of Big Data and Machine Learning tools in the companies they work at, but they choose different approaches. One of them uses off-the-shelf tools to save time and immediately starts collecting data. This specialist transfers everything they have collected into the data infrastructure and uses ML algorithms to optimize the result.
The second specialist wants more control over the data structure, so they start writing their modules. It takes a lot of time, but in the end, the specialist receives more compact and accurate data. The company saves thousands of dollars by not storing terabytes of unnecessary information but still has as much useful data as the company employing the former specialist. This money can be reinvested into creating new modules for better results.
Companies are already trying to organize the process in such a way as to reduce the collection of unnecessary data, but still, their algorithms continue processing tons of garbage. Without useful content, data remains a liability that requires additional resources to process. Focusing on Smart Data may be the solution, but this will be just the beginning of the transition to the right data techniques.
Zhann Chubukov, Head of Data Science, Andersen:
The professional community is about to come to the reasonable and logical conclusion that Big Data is just a buzzword that gobbles money and delivers low returns. Hence before building a Data Lake and a Data Warehouse, it is necessary to figure out the business problems that these things will have to solve so that the data is not only big but also reliable and smart. Collecting data is not a goal in itself; the goal is to make money from this data while simultaneously reducing operating costs and minimizing "warehouses" (data stores).
Published at DZone with permission of Pavel Svirsky. See the original article here.
Opinions expressed by DZone contributors are their own.