Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Use a Fast Data Sink, Not a Lambda Architecture for Real-Time Analytics

DZone's Guide to

Use a Fast Data Sink, Not a Lambda Architecture for Real-Time Analytics

When working with big data sets to do real-time analyses, data sinks have proven to work better than Lambda architectures. Read on to find out why.

· Big Data Zone ·
Free Resource

How to Simplify Apache Kafka. Get eBook.

Ah, the Lambda architecture. So data’s getting faster and streaming in real-time? Great! Oh, your database can’t accept a continuous stream of INSERTS and also respond to SELECTS from users at the same time at a high scale? Behold the Lambda Architecture:

Image title

By Textractor - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=34963985

Basically, the idea is to keep the fast stuff fast and the slow stuff slow. I wrote a paper 14 years ago on the challenges of real-time data warehousing. Fortunately, both the data streaming, database, and BI layers have all evolved significantly since then, and now there exist databases and other data storage engines which can support the feature trinity that is needed to do both real-time and historical analytics right, without a Lambda architecture:

  1. Accept real-time streams of data at high rates.
  2. Simultaneously respond to large volumes of queries, including on the most recently added data.
  3. Store all the history needed for analysis.

We call these engines "fast data sinks" and there are four main groups of them today:

  1. In-memory or GPU databases: databases such as SAP Hana, MemSQL, and Kinetica.
  2. Search engines: Elasticsearch and Solr.
  3. Cutting-edge Hadoop: Kudu, a storage engine that runs on the Cloudera Hadoop stack.
  4. Some cloud databases: Google BigQuery, Snowflake.

Some people try to use key-value stores and document datastores such as MongoDB and HBase for this type of use case, and it works at lower scales, but as soon as data and query volumes increase they often get too slow to be useful.

Zoomdata operates together with a fast data sink to allow interactivity and visualization on near real-time data. Streaming data can be sent directly to the fast data sink, or to Zoomdata, which immediately puts it into the fast data sink.

When users visualize data in real-time, Zoomdata runs lots of tiny queries directly on the fast data sink, effectively "tailing" the data. But these queries generally include some amount of micro-aggregation, so the raw data does not need to pass through the Zoomdata engine. This allows us to leverage the power of the fast data sink, instead of processing the data multiple times or storing it in multiple places.

Image title

A Lambda architecture, on the other hand, keeps the real-time data separate from the historical data. This is only needed if they can’t be kept together, there is no other benefit from separating “now” from history. The theory was that some analytics needs to be done on fresh data, and other, perhaps more complex, analysis doesn’t need the most recent data.  However, in reality, you almost always want the freshest data, even if you aren’t analyzing what happened in the last few seconds or minutes, you would certainly want your analysis to include any historical data that had recently been updated or corrected.

When the Lambda architecture was originally conceived, the idea was that another layer, would seamlessly union the data across the “speed” and “batch” layers on behalf of the users.  But that type of tool-level unioning hasn’t happened, and even if it did, wouldn’t support some types of analysis that need the raw data from both layers, such as distinct counts or histogram/binning type operations.

So the net is that today some of the most recent databases and data systems are able to meet the three requirements listed above to be a fast data sink. And they are getting inexpensive enough to procure and deploy that they can be used to also hold lots of history. So there really is no longer a reason to consider a Lambda architecture to handle real-time data. It can all be done in one platform, as long as that platform can act as a fast data sink.

12 Best Practices for Modern Data Ingestion. Download White Paper.

Topics:
big data ,data sink ,real-time analytics ,lamdba architecture ,big data analysis

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}