Wake Up From the Big Data Nightmare
Wake Up From the Big Data Nightmare
The dark side of Big Data: what everyone forgets to mention.
Join the DZone community and get the full member experience.Join For Free
If you don’t actually work with Big Data, and you only know about it from what you hear in the media — how it can be used to optimize traffic flows, make financial trade decisions, foil terrorist plots, make devices smarter and self-operating, and even track athletic performance — you’ll probably say it’s a dream come true.
However, for those who actually extract, analyze, and manage Big Data so it can do all those wondrous things, it’s often nothing but a nightmare.
You may also like: Leveraging Your Db2 Skills With Big Data.
Mining Big Data: All Drudgery and Grunt Work?
Working with high-volume, fast-changing data streams can be quite mind-boggling and definitely more complex than just looking at spreadsheets, tables, and dashboards.
Did you know that 2.5 quintillion bytes of data are generated each day — a speed that accelerates even faster with the advancements in IoT? With such unbelievable amounts of information, how do you catch up?
Say you wanted to use Big Data to answer a simple question, such as, “How many users have logged into our company’s online app in the last hour?” It’s very easy to answer this if you only have a few hundred users, but what if you are talking about a common app used by millions of people?
If you think getting the answer will be as easy as one-two-three, well, let’s see if you’re right.
The supposedly easy process actually involves:
Storing all raw data in a repository — “data lake.” This process requires knowing and adhering to best practices when it comes to compression, partitioning, and naming rules if you want to keep all that information intact for the future.
Writing code to simply understand the data that you’ve collected.
Doing additional coding or programming — “blindly” as the data is not visually depicted — if you are working on Extract, Transform, Load (ETL) jobs, each of which takes days to write and hours to run.
Ensuring that ETL jobs are run efficiently by assigning a developer to manage and control the orchestration systems (e.g., Apache Airflow or NiFi).
Create a NoSQL database to manage the stateful ETLs.
Managing an integrated analytics database — such as Amazon Redshift — to use in executing SQL queries.
And, finally, after almost a year and thousands of developer hours after you first asked: “How can I accurately predict what my customers want and make a really targeted offer,” getting the answer.
After all this, you are NOT done because the next step is REPEAT.
Yes, repeat the difficult, complex process. In addition to spending hundreds of thousands of more dollars on software, storage, and manpower expenses to make sure all the code-exhaustive parts of the process are operating well together, you will have to perform this process each time you need another business question answered or have new data sources added.
At least you don’t have to rinse and lather first, too because somewhere in there are devices that can’t get wet.
Mining Big Data vs. Small Data? No Contest
No doubt someone who’s working with “small” data, e.g., ERP and financial data, has it easy. The recipe is simple: Acquire a database, hit it with some SQL queries and a dashboard — and you’ve got something you can use.
No tough code-heavy processes — just GUI-based tools — or clunky and unwieldy architecture needed, and anyone with elemental SQL knowledge can access and utilize business data to answer reasonably simple questions.
Beating Big Data Complexity to Wake Up from the Nightmare
If the goal is to simplify Big Data and reduce the time and resources it takes to convert raw data streams into useful and usable information, then you need to tackle the problem from a very different perspective.
CUT the number of systems needed to transform data into a workable form. Who says you have to have three separate open-source frameworks for data cataloging, integration, and serving? Instead, build a system that can be applied to common use cases in big and streaming data analytics.
VISUALIZE the data. You’re wasting time if you’re writing code with only a hazy understanding of what the actual schema or architecture is. What if you had a visual catalog that immediately provides a picture of the data structure, including stats related to distinct values, value distribution and how often it occurs in the full data set?
AUTOMATE code-heavy procedures. Successfully working with Big Data simply means adopting best practices in storage, partitioning, and SQL operations. A solution with these best practices already built-in will optimize performance and minimize costs.
No, such a solution will STILL not make Big Data as easy as Excel. BUT, you can now say goodbye to huge and expensive data engineering teams, months-long data projects just to answer simple analytical queries, too much time spent on infrastructure, and other Big Data nightmares.
Opinions expressed by DZone contributors are their own.