Over a million developers have joined DZone.

How to do real time complex query on Big Data

DZone's Guide to

How to do real time complex query on Big Data

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

Originally written by Ron Zavner.

In the past few years, almost every function of our lives has become dependent on real time applications. Whether it is updating our friends on every move we make via social media, shopping on e-commerce websites, expecting instant results from finanicial and governmental websites; we have become completely dependent on getting the correct information quickly.

What we often fail to realize is that there are a few challenges here:

  • The vast amount of data that flows in these systems
  • The need for a highly available application and data store.
  • High performance requirements.
  • Support for complex querying.
  • Transactional support
  • I think that we can try to divide these challenges into 3 parts: big data, real time and complex querying.

    First challenge – Big Data

    Now if we start with big data, there are plenty of solutions we can utilize to solve these problems. The most popular solutions would be mongoDB, Cassandra and Hadoop. All of these solutions are distributed environments which have multiple partitions that contain our data. They also replicate partitions to ensure we can serve the data from another machine if one machine is down (most are eventual consistent which means that the replica might not have the most recent update of the data but that’s for another discussion). Then if we take these NoSQL databases we can easily overcome this challenge with the amount of data as well as the issue with high availability. It is also a scalable solution since we can add more compute and storage resources which results in being able to support more data and throughput.

    Second challenge – real time

    Real time is our main challenge. The solutions we provided here are mostly disk based which means we don’t have support for the real time part as complex queries can take minutes and sometimes even more. This is why we turn to IMDG (in memory data grid) which stores some or all of the data in memory. When data is stored in memory, calculations can be done extremely fast using RAM and not I/O access.

    But this solution is not so easy either. We can store maybe a few TB of data in RAM but what happens if we have more than that? Say 50TB … even though RAM is much cheaper these days, 50TB would be very expensive. Moreover, that becomes too many machines to manage in one data grid cluster. Some of the IMDG solutions offer a way to store some of the data in memory and some of it in the disk (or DB) but then again we have the same problem of slow I/O access to disk which results in “not so real time query”.

    Fortunately, SSD can provide us a great combination of both– if we use it right. Although SSD is not as fast as RAM, it is much faster than the normal disk and yet much cheaper than RAM. Now there are 2 ways we can utilize SSD for very large clusters and real time complex querying:

  • Fast index mode – we store inRAM the fields that we query on and the rest of the fields are stored on SSD. If for example we have a big object with many fields, we can store only a few of them inRAM (indexed) and the less important fields on SSD so we can still query them very fast relatively to regular disk.
  • Last recently used – the most recent used objects will be stored in RAM and the others will be evicted to SSD. This approach still might have the challenge of real time since the query engine needs to work with SSD and not with RAM for complex calculation that require all data.
    So essentially, SSD let us enjoy both worlds – very fast access to data using RAM for the “important” fields and a memory extension to have real support for big data which we can’t have with a basic IMDG.

  • Third challenge – complex querying

    We’re still left with the complex querying part- most applications still have the need for real time analytics that in relational DBs we can easily implement with aggregations (avg, min, max, sum, group by and so on. In distributed environments it is much more complex since the data is partitioned across the cluster and to aggregate it means – we either need to bring all the data to the client (not an option since it is too much) orusemap reduce logic. Using map reduce logic is an okay solution but much less intuitive than a simple SQL statement with group by.

    A potential solution – XAP

    XAP is an in memory data grid provided by Gigaspaces that can help solve all of the challenges stated above. In XAP version 10, there is support for SSD integration as stated above and complex querying. It also offers, high availability, scalability, transaction support, SQL queries (including nested querying).

    Let’s say we have a table of 100M employees which is too heavy too large a dataset for in memory and we want to run complex query in real time. We could store the employee IDs, department IDs, salary and age in RAM and the rest of the fields (address, notes, records, previous salaries etc in the SSD). The advantage of the SSD integration is that XAP really leverages the SSD power by working with its native API and storing the data as a key value store while other integrations are using SSD just as a fast disk. While when you use SSD just as a fast disk you do gain some performance impact, the performance impact is smaller than working with the SSD API because it still has to go through some hoops of a disk based solution. XAP would just store the object payload in the SSD using key value utilizing the API.
    Now we want to get the average, min and max salary grouped by department and sex type only for employees which are older than 50.
    In SQL it would look like:

    select AVG(salary), MIN(salary), MAX(salary) from Employees WHERE age > 50 group by Department, SexType

    In XAP it would look like:

    query = new SQLQuery(Person.class, “age > ?”, 50);
    groupByResult = groupBy(gigaSpace, query, new GroupByAggregator()
    .groupBy(“department”, “sexType”)

    To conclude, the need for real time complex querying on big data is out there. We can use SSD to store payload of objects while the indexes will be stored inRAM for ultra fast access – that is how we can combine a real in memory data grid with the native support for big data. The next challenge we need to face is how to run (intuitively) complex querying that work on nested fields or aggregations on that huge set of data. I showed a quick example here how it can be done with XAP and I’d be happy to see more examples if you can share.

    Originally written by Ron Zavner.

    Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.


    Opinions expressed by DZone contributors are their own.

    {{ parent.title || parent.header.title}}

    {{ parent.tldr }}

    {{ parent.urlSource.name }}