Over a million developers have joined DZone.

In-Memory Real-Time Analytics, Done Right

DZone's Guide to

In-Memory Real-Time Analytics, Done Right

· Java Zone
Free Resource

Build vs Buy a Data Quality Solution: Which is Best for You? Gain insights on a hybrid approach. Download white paper now!

This post was originally written by Raj Bains

There is a lot of hype in the media about in-memory computing and in-memory real-time analytics. SAP HANA has used savvy marketing and experimental projects across their customer base to create a storm of interest followed by real revenue. Now, everyone wants in. Oracle has jumped in recently with an In-Memory option for Oracle 12c. VoltDB is using in-memory for transactions alone, with emphasis on processing fast moving data. ClustrixDB  does transactions and real-time analytics and has chosen a smart combination of memory and SSDs that provides the right combination of performance, data durability and cost for our customers.

Let’s walk through what you need to consider when deciding if you need in-memory computing.

Memory: speed, cost and the correct choice for your application

Does your application need memory, SSD or HDD speed?

The most important question is what latency does your application need? As an example, let’s take a very challenging application – bidding for online ads. The entire process of choosing and showing the ads has to be done while the user is loading a webpage. Usually the bid has to be made in 100-120 millisec and sometimes as little as 50 millisecs. While, one would prefer the data to already be cached in memory for the best response time, lets explore the worst-case scenario. For SSD, a page can be read in about 50 microsec while for HDD it takes about 80 millisec. For most workloads, the latency of SSD is a good balance of latency and cost.

First Diagram


What does in-memory cost you?

There is no cost number one can come up with that cannot be debated and is the right fit for everyone. The conclusion we have drawn is that keeping all the data in-memory is unacceptably expensive for most people and we see this dynamic continuing for multiple years. Many clients still like spinning disks, but SSD prices have recently become acceptable to most, especially since database size for most applications is typically 500GB to tens of terabytes.

Here is an AWS cost for 2 TB (i.e. 1 TB, with two copies)

- 2TB RAM  = 8 * cr1.8xlarge = 3.5*8 = $28/hr (more cores also) = $245,000/year

- 2TB SSD = hi1.4xlarge = $1.6/hr = $14,000/year

What is the temperature of your data: hot or cold?

For most applications, there is 10-20% of the data that is hot and is accessed very frequently. The rest of the data is “cold” i.e. accessed less frequently or rarely if ever. Many in-memory databases insist on putting all your data into memory and this is just not cost effective for most customers.

System of record – or an add-on solution?

Real-time query speed or real-time data?

Many people confuse the fast query return time that in-memory or fast columnar databases provide, with whether the data is latest or old. While some streaming analytics solutions have the latest data, most data warehousing solutions are doing fast analytics on yesterday’s data. That is hardly real-time since it does not allow the business models that need immediate feedback and insight into the business.

Second Diagram

Scale-out enables real-time analytics on system-of-record database

This is one of the least understood concepts. The legacy SQL databases are scale-up. You can buy a big iron box for millions of dollars to scale – but you keep this for transactions only and buy a second system for analytics. But with a scale-out SQL solution such as ClustrixDB, you can simply add commodity servers to scale. This means you can cheaply setup a cluster with tens or hundreds of cores that has enough resources for transactions and real-time analytics both. Many businesses will still need a data warehouse but there is a huge opportunity in business models enabled by ad hoc real-time analytics on the primary database. This is especially true for SaaS offerings where having a single database simplifies the environment, especially given every customer might have different requirements.

Third Diagram

Data durability and recovery time

Another question to consider is whether your in-memory database provides the durability required to be your system of record. Failures in large distributed systems are very common. A loss event can mean a lost node, power failure in a region or a crash due to a database bug that happens in all copies of the database. Loss events often take all the nodes of the database down.

Data durability provided by many in-memory solutions involves unacceptably high recovery time on failure. All the data has to be loaded back into memory before the database is available again. In a world where the application is doing tens or hundreds of thousands of transactions per second, this is just not acceptable for system of record database.

Replacing ETL with replicated analytics slave

Some ClustrixDB customers prefer to have a separate system for analytics, especially when there is concern about inadvertently adding analytics queries that may impact the transactional side. A master-slave setup here can provide a slave that is near real-time with an exact copy of the master database.

Fourth Diagram

ClustrixDB: Scale-out architecture with SSD backed memory

At ClustrixDB, we believe that SSD backed memory is the right way to do in-memory analytics:

1) Hot data lives in memory, cold data a few microseconds away in SSDs

2) SSDs provide the durability to be system-of-record

3) Scale-out allows running real-time analytics on system-of-record database

In addition to striking the right balance of memory and SSD, ClustrixDB software uses a combination of multi-version concurrency control (MVCC) and massively parallel processing (MPP) for faster analytics.

With ClustrixDB:

1) Analytics run on latest real-time data

2) Analytics run at fast real-time query speed

2.1) ClustrixDB MVCC means reads and writes don’t interfere

2.2) ClustrixDB MPP uses multiple cores across multiple nodes to make each analytic query go fast.

3) Shared nothing architecture means that total memory is the sum of memory of all nodes. So as you add nodes, more of your data lives in memory.

ClustrixDB has designed in-memory real-time analytics to be done right that offers the right combination of cost, performance, and data durability characteristics for the vast majority of customers.

Build vs Buy a Data Quality Solution: Which is Best for You? Maintaining high quality data is essential for operational efficiency, meaningful analytics and good long-term customer relationships. But, when dealing with multiple sources of data, data quality becomes complex, so you need to know when you should build a custom data quality tools effort over canned solutions. Download our whitepaper for more insights into a hybrid approach.


Published at DZone with permission of Lisa Schultz, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}