Is Data Discovery Just a Buzzword?
Is Data Discovery Just a Buzzword?
A real data discovery architecture is one that gives non-technical end users the ability to own every step of the data analytics process. Everything else is just fluff.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Ever been irked by a buzzword? If you’re a data analyst, we’re betting you have.
The business analytics industry is notorious for its use of jargon. That's not a problem for those of us in the loop, but for business users who want to get in on the analytics action, the terminology alone can be a major barrier.
That’s why we’re vigilant about explaining BI in a way that all users can understand. Business analytics should be accessible for all your organization’s diverse stakeholders, but it’s easy to get overwhelmed by the unending stream of new phrases and terminology.
If you’re a business user who wants to capitalize on the huge opportunity presented by better business data analytics, read on. We’re about to give you the real deal when it comes to one term you’ve definitely seen a lot of: data discovery.
What Is Data Discovery?
Let’s start with a simple definition.
At its core, data discovery is the process of teasing out relevant data insights and delivering those insights to the business users who need them — a great proposition as more business users want to access and analyze their own data.
In the early days of BI, data analytics were reserved to technical and IT departments. Marketing managers, R&D heads, and any other type of business user had to rely on manual reports or templates that, as you can imagine, didn’t always give them the answers they needed.
Enter: Data Discovery Tools.
Data Discovery’s Rise to Fame
In 2008, Kurt Schlegel, a Research VP at Gartner, published a paper called, “The Rise of Data Discovery Tools.” In it, he predicted big growth for data discovery tools and indeed, by 2012, data discovery comprised a multi-billion dollar industry under the larger umbrella of BI.
As more diverse business users called for data visualizations they could access and digest quickly, the need for data discovery systems soared.
Suddenly, business users across the organization were able to get the answers (and internal approvals) they needed in a format that was easy to understand and act on. But, it came at a price.
Traditional Data Discovery Requires Costly Data Prep
Most data discovery tools rely on resource-heavy data prep, forcing you to aggregate the data before you can visualize it.
This requires additional tools for cleansing, on top of a separate data warehouse, and usually at least one frustrated call to IT. This process is not only lengthy and expensive but also leaves a lot of room for error. Unfortunately, this problem of a high cost, highly fragmented setup also applies to data discovery visualization tools.
Most Data Discovery Visualization Tools Don’t Give You the Full Picture
If you’re like most organizations, you’re probably working with multiple systems like Google Analytics, SQL, and Excel. What you want is a single view of the data to help you prove your point or make the right call.
Many visualization tools can indeed combine multiple sources into a single table. But for this to happen, someone has to model the data first in order to get a correct analysis. And much like data prep, data modeling takes time and resources.
This would tempt many users into simply skipping the time-intensive data modeling step, but that would be a big mistake. Without the right data modeling, you’ll be looking at inaccurate data and defeating the purpose altogether.
So with all the cost and productivity barriers, why are data discovery tools still so popular?
What’s the Alternative?
Since most visual analytics platforms offer only half a solution (i.e. visualization with no or incomplete data prep and modeling), data discovery tools are still the most popular tool for filling the gaps (not to mention boosting revenues for service providers).
In the past, investing in these tools made sense. It was the only way organizations could make their data available across the enterprise.
Thankfully, the BI landscape has evolved. Now, with a plethora of full stack solutions and columnar-based tools, you can connect directly to raw data and join data sources for a single data mode — or as we like to call it, a single version of the truth.
Real Data Discovery Is Full Exploration of Complex Data for All Users
A great BI tool is one that combines multiple disparate data sources in one user-friendly, easy-to-read, and more importantly, accurate data visualization through logical joins.
In other words, you get to mash up multiple data sources without messing up your analysis, which also solves the vast majority of modeling challenges before they even arise.
With a full-stack solution, there’s no need to spend time or money on upkeep because there’s no need to juggle multiple models or worry about manually cleaning, structuring, and updating data in a centralized data warehouse.
You can now do all of this visually, no coding required. A system with a focus on predictive analytics can even remember past updates and automate them to save you time in the future.
You Don’t Need Data Discovery to Get Greater Collaboration
Investing in a variety of data discovery tools actually creates more gaps within the organization and places a greater burden on IT and technical teams as they try to support multiple systems and users.
Compare that with a full stack solution and there’s no question about which one actually democratises data.
For example, a columnar-based solution combines different datasets and accesses insights from raw data. Business users can plug in data sources as and when they need them in no time.
Real Data Discovery Lets Non-Technical End Users Own Every Step of the Data
With a more modern, full stack solution every business user can take control of their data management. Even non-tech users can enter easy commands to run logic on data and make it as simple or complex as you need, all within a single environment.
Multiple people can access data without having to download the files to their PC, update the data, and then reload the server as is the case with many data discovery tools. This long-held and incredibly cumbersome process isn’t just time intensive, it requires a huge amount of RAM and CPU on each user’s machine, which can get very expensive, very quickly.
On the other hand, full stack solutions with a central server let you easily add a file from your own machine and then make the changes on the remote server directly, giving you faster data syncing while using much fewer resources. You can even work with billions of rows of data.
Full stack solutions with a central server also resolve the issue of errors and discrepancies arising from multiple people accessing the same data simultaneously, because everything is synced in real time.
Full Stack Visualization Tools Give You the Full Picture
How many calculations do you want to see within one query or analysis? Most visualization tools make you summarize the data on two levels. But sometimes multiple calculations are what’s needed.
For example, if you want to compare how many product units were sold each month compared to average sales per day, you need extra calculations — first the sum for every month, then divided by the number of days in the month. And most of the time you only get presented with the monthly overview OR the daily breakdown.
With a columnar-based tool, you can create complex custom formulas on the data that let you view multiple calculations simultaneously. You’re not limited to just two as with most data discovery tools.
You can also create your own dashboards and access them on any browser or mobile device using an on-site or cloud-based infrastructure, something that will be increasingly important as smaller companies aim to improve their BI.
Data Discovery Tools Can’t Keep Up With Growing Complexity
The need to scale cannot be ignored.
As Saar Bitner, VP Marketing at Sisense puts it:
“Alongside the size of the data, today’s data is often very diverse in nature and is no longer confined to spreadsheets, with various automated systems generating large amounts of structured or semi-structured data, as for example could be the case in machine data, social network data, or data generating by the Internet of Things (IoT).”
You need a solution that can grow as you have more business users requiring data analysis — not one that will require still more solutions.
Hardware cost will be an issue for data discovery tools as the complexity of data increases.
With most data discovery solutions, all the data gets loaded in RAM. You need a whole lot of RAM and CPU to support this. Increasing capability quickly gets VERY expensive.
Compare that to an in-memory columnar database which stores the data on-disk and only uses RAM when a query is running and the cost-benefit is clear.
So, Is Data Discovery a Buzzword?
We think so.
A “real” data discovery architecture is one that gives non-technical end users (i.e. marketing managers and business analysts) the ability to own every step of the data analytics process, from preparing the data for analysis to visualizing the results, even when working with complex data from multiple sources.
Anything else is just fluff.
Published at DZone with permission of Shelby Blitz , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.