How to do real time complex query on Big Data
Join the DZone community and get the full member experience.Join For Free
Originally written by Ron Zavner.
In the past few years, almost every function of our lives has become dependent on real time applications. Whether it is updating our friends on every move we make via social media, shopping on e-commerce websites, expecting instant results from finanicial and governmental websites; we have become completely dependent on getting the correct information quickly.
What we often fail to realize is that there are a few challenges here:
I think that we can try to divide these challenges into 3 parts: big data, real time and complex querying.
First challenge – Big Data
Now if we start with big data, there are plenty of solutions we can utilize to solve these problems. The most popular solutions would be mongoDB, Cassandra and Hadoop. All of these solutions are distributed environments which have multiple partitions that contain our data. They also replicate partitions to ensure we can serve the data from another machine if one machine is down (most are eventual consistent which means that the replica might not have the most recent update of the data but that’s for another discussion). Then if we take these NoSQL databases we can easily overcome this challenge with the amount of data as well as the issue with high availability. It is also a scalable solution since we can add more compute and storage resources which results in being able to support more data and throughput.
Second challenge – real time
Real time is our main challenge. The solutions we provided here are mostly disk based which means we don’t have support for the real time part as complex queries can take minutes and sometimes even more. This is why we turn to IMDG (in memory data grid) which stores some or all of the data in memory. When data is stored in memory, calculations can be done extremely fast using RAM and not I/O access.
But this solution is not so easy either. We can store maybe a few TB of data in RAM but what happens if we have more than that? Say 50TB … even though RAM is much cheaper these days, 50TB would be very expensive. Moreover, that becomes too many machines to manage in one data grid cluster. Some of the IMDG solutions offer a way to store some of the data in memory and some of it in the disk (or DB) but then again we have the same problem of slow I/O access to disk which results in “not so real time query”.
Fortunately, SSD can provide us a great combination of both– if we use it right. Although SSD is not as fast as RAM, it is much faster than the normal disk and yet much cheaper than RAM. Now there are 2 ways we can utilize SSD for very large clusters and real time complex querying:
So essentially, SSD let us enjoy both worlds – very fast access to data using RAM for the “important” fields and a memory extension to have real support for big data which we can’t have with a basic IMDG.
Third challenge – complex querying
We’re still left with the complex querying part- most applications still have the need for real time analytics that in relational DBs we can easily implement with aggregations (avg, min, max, sum, group by and so on. In distributed environments it is much more complex since the data is partitioned across the cluster and to aggregate it means – we either need to bring all the data to the client (not an option since it is too much) orusemap reduce logic. Using map reduce logic is an okay solution but much less intuitive than a simple SQL statement with group by.
A potential solution – XAP
XAP is an in memory data grid provided by Gigaspaces that can help solve all of the challenges stated above. In XAP version 10, there is support for SSD integration as stated above and complex querying. It also offers, high availability, scalability, transaction support, SQL queries (including nested querying).
Let’s say we have a table of 100M employees which is too heavy too large a dataset for in memory and we want to run complex query in real time. We could store the employee IDs, department IDs, salary and age in RAM and the rest of the fields (address, notes, records, previous salaries etc in the SSD). The advantage of the SSD integration is that XAP really leverages the SSD power by working with its native API and storing the data as a key value store while other integrations are using SSD just as a fast disk. While when you use SSD just as a fast disk you do gain some performance impact, the performance impact is smaller than working with the SSD API because it still has to go through some hoops of a disk based solution. XAP would just store the object payload in the SSD using key value utilizing the API.
Now we want to get the average, min and max salary grouped by department and sex type only for employees which are older than 50.
In SQL it would look like:
select AVG(salary), MIN(salary), MAX(salary) from Employees WHERE age > 50 group by Department, SexType
In XAP it would look like:
query = new SQLQuery(Person.class, “age > ?”, 50); groupByResult = groupBy(gigaSpace, query, new GroupByAggregator() .groupBy(“department”, “sexType”) .selectAverage(“salary”) .selectMinValue(“salary”) .selectMaxValue(“salary”));
To conclude, the need for real time complex querying on big data is out there. We can use SSD to store payload of objects while the indexes will be stored inRAM for ultra fast access – that is how we can combine a real in memory data grid with the native support for big data. The next challenge we need to face is how to run (intuitively) complex querying that work on nested fields or aggregations on that huge set of data. I showed a quick example here how it can be done with XAP and I’d be happy to see more examples if you can share.
Originally written by Ron Zavner.
Opinions expressed by DZone contributors are their own.