How to Streamline Query Times to Handle Billions of Records
How to Streamline Query Times to Handle Billions of Records
So you have a lot of data you need to get through, and your system is starting to feel the strain. How can you ease the pain and make those queries run even faster?
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Here at Sisense, we love a challenge. So when a client comes to us and tells us they need to find a way to run queries on billions of records without this slowing them down, our ears perk up and we leap at the chance to find a solution.
In fact, that’s how we recently found ourselves testing a billion transactional records and three million dimensional records — totaling a whopping 500GB of data — with 100 concurrent users and up to 38 concurrent queries, with a total setup time of just two hours… and an average query time of 0.1 seconds!
But wait! I’m getting ahead of myself. Let’s start by talking through some of the issues that affect how fast you can query data.
How Are You Storing Your Data?
Let’s start with the obvious: data warehousing.
Typically, working with masses of data means you also need extensive data warehousing in place to handle it alongside extract-transform-load tools that uploads data from the original source on a regular basis (extract), adjusts formats and resolve conflicts to make the datasets compatible (transform), and then delivers all of this data into the analytical repository where it’s ready for you to run queries, calculations, and trend analysis (load).
This creates a single version of the truth — a source of data that brings together all your disparate pieces into one place. While this is great, there are also some drawbacks to data warehousing.
First of all, data warehouses are highly structured, and the row-and-column schema can be overly restrictive for some forms of data. Also, the sheer volume of data quickly overloads most systems, grinding to a halt if you run queries that attempt to tap into the entire data pool.
Then, there are data marts.
To help tackle the issues that come with working with huge data sets, many IT teams deploy data marts alongside their databases. These essentially siphon off access to a smaller chunk of the data, and then you select which data marts each department or user has access to. The outcome of this is that you put less pressure on your hardware, as your computer is tapping into smaller pools of data; but the flipside is that you have vastly reduced access to the organization’s total data assets in the first place.
At the other end of the scale, you have data lakes.
These are used to store massive amounts of unstructured data, helping to bypass some of the issues that come with using conventional data warehouses. They also make sandboxing easier, allowing you to try out different data models and transformations before you settle on a final schema for your data warehouse to avoid getting trapped into something that doesn’t work for you.
The trouble with data lakes is that, while the offer formidable capacity for storing data, you do need to have all kinds of tools in place to interface between the data lake and your data warehouse, or with your end data analytics tool if you want to skip the need for warehousing on top. Systems like this that use data lakes aren’t exactly agile, so your IT team will need to be pretty heavily involved in order to extract the insights you want.
Alternatively, you might deal with unstructured data using an unconventional data storage option.
For example, you might use a NoSQL database like MongoDB. This gives you tons of freedom in terms of the kind of data you add and store, and the way that you choose to store it. MongoDB also makes use of sharding techniques to avoid piling the pressure on your IT infrastructure, allowing for (pretty much) infinite scaling.
The downside, of course, is that the thing that makes this so great — the unstructured, NoSQL architecture — also makes it tricky to feed this data straight into a reporting tool or analytics platform. You need a way to clean up and reconcile the data first.
What About Tools Used for Analysis?
Dynamic DBMS tools like PostgreSQL can open doors.
PostgreSQL is an analytics and reporting tool that allows you to work with an enormous variety of data types — including native data types that give you much more freedom as you come to build and manipulate a BI solution, and “array” types, which help you to aggregate query results rapidly on an ad hoc basis.
Introducing PostgreSQL into the mix can be massively helpful in bringing together your disparate strands — but again, it can’t do everything. It can’t help much with qualitative data, and as a non-relational database (which wasn’t built to handle Big Data) it will buckle under huge volumes of information.
You can also use R for high-end predictive analytics. Once you have a solid BI system in place, you can add another layer of awesomeness by using R to build working models for statistical analysis quickly and easily. R is incredibly versatile and allows you to move away from static reporting by programming a system for analysis that you can adapt and improve as you go.
The thing is, though, this is an add-on; it doesn’t replace your current BI or data analytics system. R is an excellent programming language that can help you generate predictive analytics fast, but you need to have a rock-solid system in place for handling and preparing data in the first place.
How to Streamline Everything
I know what you’re thinking: I said I was going to explain how to streamline your data queries to help you generate results faster, but so far, all I’ve done is dangle some potential solutions and then show you how they fall short!
That’s because I haven’t revealed the secret sauce that binds all these pieces together in perfect harmony.
As you can see, each of the tools we’ve discussed is used to fix one problem in the storage, flow, and use of data within your organization, but they don’t help with the big picture. That’s where Sisense’s Elasticube comes in. The Elasticube allows you to store data or drag it in directly from your existing stores at lightning speed, giving users unfettered access to their entire pool of data, whatever format it’s kept in (unless you choose to stagger permissions). Thanks to clever use of In-Chip Processing and a Columnar Database structure, you tap into only the data you need for the query, without restricting yourself permanently, as you would with a data mart.
You can then reconcile and harmonize this data with minimal hassle to treat all these strands as a single data source for the purpose of analysis and reporting.
Still within the Elasticube, you can map and manipulate these data sources to build your own dashboards and run your own queries at incredible speed.
Plus, using our range of custom-built connectors, you can link your Sisense Elasticube directly to MongoDB, PostgreSQL, and other DMBS tools, and you can integrate Sisense with R for even more in-depth predictive analytics.
So that’s the big secret. Using the Sisense Elasticube, I was able to set up a system in 120 minutes that could run concurrent queries on data representing one billion online purchases, from three million origins/destinations, with an average query time of 0.1 seconds and a maximum query time of just 3 seconds.
Pretty impressive, huh? Here’s what it looked like:
And here’s an example dashboard that we used to display the results in real time:
How’s that for streamlined?
Published at DZone with permission of Kevin Ruppel , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.