DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Applying Kappa Architecture to Make Data Available Where It Matters
  • Real-Time Data Architecture Frameworks
  • AI-Powered Knowledge Graphs
  • DZone Community Awards 2022

Trending

  • Blue Skies Ahead: An AI Case Study on LLM Use for a Graph Theory Related Application
  • Ethical AI in Agile
  • The Full-Stack Developer's Blind Spot: Why Data Cleansing Shouldn't Be an Afterthought
  • Performing and Managing Incremental Backups Using pg_basebackup in PostgreSQL 17
  1. DZone
  2. Data Engineering
  3. Data
  4. What Is Lambda Architecture? Ultimate Guide to Getting Started

What Is Lambda Architecture? Ultimate Guide to Getting Started

Discover the basics of lambda architecture in our comprehensive guide. Explore its components, workflow, applications, and best practices.

By 
Michael Bogan user avatar
Michael Bogan
DZone Core CORE ·
Updated Oct. 01, 24 · Analysis
Likes (11)
Comment
Save
Tweet
Share
15.3K Views

Join the DZone community and get the full member experience.

Join For Free

The volume of data generated today is immense, coming from sources like the Internet of Things (IoT), social media, app metrics, and device analytics. The challenge for data engineering is processing and analyzing this vast amount of information in real time. This raises a crucial question: Are our current data systems capable of meeting this demand?

How Does Lambda Architecture Work? 

Lambda architecture is a data engineering paradigm — a way of structuring a data system to overcome the latency problem and the accuracy problem. It was first introduced by Nathan Marz and James Warren in 2015. Lambda architecture structures the system into three layers: a batch layer, a speed layer, and a serving layer, which can be found in the figure below.

Lambda architecture

Lambda architecture is horizontally scalable. This means that if your data set becomes too large or the data views you need are too numerous, all you need to do is add more machines. It also confines the most complex part of the system in the speed layer, where the outputs are temporary and can be discarded (every few hours) if ever there is a need for refinement or correction.

The Three Layers  of Lambda Architecture Explained 

An elderly man living alone in a large mansion decides to set all his clocks to match the correct time on his kitchen clock, 9:04 AM. Due to his slow pace, by the time he reaches the last clock at 9:51 AM, he still sets it to 9:04 AM, making all his clocks incorrect. This is the problem we would experience if a data system only had a batch layer. The next day, he uses a stopwatch — his speed layer — to track his progress. Starting at 9:04 AM, he notes the elapsed time from the stopwatch when setting each clock, ensuring the last clock is accurately set to 9:51 AM.

Check Out Free Course to Learn More: "Introduction to Data Engineering"*

*Affiliate link. See Terms of Use.

In this analogy  —  which, admittedly, is not perfect  —  the man is the serving layer. He takes the batch view (the paper with “9:04 AM” written on it) around his mansion to answer “What time is it?” But, he also does the additional work of reconciling the batch view with the speed layer to get the most accurate answer possible.

Understanding the Batch Layer 

As mentioned above, lambda architecture is made up of three layers. The first layer  —  the batch layer  —  stores the entire data set and computes batch views. The stored data set is immutable and append-only. New data is continually streamed in and appended to the data set, but old data will always remain unchanged. The batch layer also computes batch views, which are queries or functions on the entire data set. These views can subsequently be queried for low-latency answers to questions of the entire data set. The drawback, however, is that it takes a lot of time to compute these batch views.

Understanding the Serving Layer 

The second layer in lambda architecture is the serving layer. The serving layer loads in the batch views and, much like a traditional database, allows for read-only querying on those batch views, providing low-latency responses. As soon as the batch layer has a new set of batch views ready, the serving layer swaps out the now-obsolete set of batch views for the current set.

Understanding the Speed Layer 

The third layer is the speed layer. The data that streams into the batch layer also streams into the speed layer. The difference is that while the batch layer keeps all of the data since the beginning of its time, the speed layer only cares about the data that has arrived since the last set of batch views completed. The speed layer makes up for the high-latency in computing batch views by processing queries on the most recent data that the batch views have yet to take into account.

Advantages of Lambda Architecture 

In Marz and Warren’s seminal book explaining lambda architecture, among other topics, they list eight desirable properties within a big data system, describing how lambda architecture satisfies the following requirements:

  1. Robustness and fault tolerance: Because the batch layer is designed to be append-only, thus containing the entire data set since the beginning of time, the system is human-fault tolerant. If there is any data corruption, all of the data from the point of corruption forward can be deleted and replaced with the correct data. Batch views can be swapped out for completely recomputed ones, and the speed layer can be discarded. In the time it takes to generate a new set of batch views, the entire system can be reset and running again.
  2. Scalability: Lambda architecture is designed with layers built as distributed systems. By simply adding more machines, end users can easily horizontally scale those systems. 
  3. Generalization: Since lambda architecture is a general paradigm, adopters aren’t locked into a specific way of computing this or that batch view. Batch views and speed layer computations can be designed to meet the specific needs of the data system.
  4. Extensibility: As new types of data enter the data system, new views will become necessary. Data systems are not locked into certain kinds or a certain number of batch views. New views can be coded and added to the system, with the only constraint being resources, which are easily scalable. 
  5. Ad hoc queries: If necessary, the batch layer can support ad hoc queries that were not available in the batch views. Assuming the high-latency for these ad hoc queries is permissible, then the batch layer’s usefulness is not restricted only to the batch views it generates.
  6. Minimal maintenance: Lambda architecture, in its typical incarnation, uses Apache Hadoop for the batch layer and ElephantDB for the serving layer. Both are fairly simple to maintain.
  7. Debuggability: The inputs to the batch layer’s computation of batch views are always the same: the entire data set. In contrast to debugging views computed on a snapshot of a stream of data, the inputs and outputs for each layer in lambda architecture are not moving targets, vastly simplifying the debugging of computations and queries.
  8. Low latency reads and updates: In lambda architecture, the last property of a big data system is fulfilled by the speed layer, which offers real-time queries of the latest data set.

Challenges of Lambda Architecture and Key Considerations 

While the advantages of lambda architecture seem numerous and straightforward, there are some disadvantages to keep in mind. First and foremost, cost will become a consideration. While how to scale is not very complex — just add more machines — we can see that the batch layer will necessarily need to expand and grow over time. Since all data is append-only and no data in the batch layer is discarded, the cost of scaling will necessarily grow with time.

Others have noted the challenge of maintaining two separate sets of code to compute views for the batch layer and the speed layer. Both layers operate on the same set—or, in the case of the speed layer, subset—of data, and the questions asked of both layers are similar. However, because the two layers are built on completely different systems (for example, Hadoop or Snowflake for the batch layer, but Storm or Spark for the speed layer), code maintenance for two separate systems can be complicated.

Stream Processing and Lambda Architecture Challenges 

We noted above that “the speed layer only cares about the data that has arrived since the completion of the last set of batch views.” While this is true, it’s important to make a clarification that small subset of data is not stored; it is processed immediately as it streams in, and then it is discarded. The speed layer is also referred to as the “stream-processing layer.” Remember that the goal of the speed layer is to provide low-latency, real-time views of the most recent data, the data that the batch views have yet to take into account.

On this point, the original authors of the lambda architecture refer to “eventual accuracy,” noting that the batch layer strives for exact computation while the speed layer strives for approximate computation. The approximate computation of the speed layer will eventually be replaced by the next set of batch views, moving the system towards “eventual accuracy.”

Processing the stream in real time in order to produce views that are constantly updated as new data streams in milliseconds is an incredibly complex task. Partnering a document-based database with an indexing and querying system is often recommended in these cases. 

Differences Between Lambda Architecture and Kappa Architecture 

We noted above that a considerable disadvantage of lambda architecture is maintaining two separate codebases to handle similar processing since the batch layer and the speed layer are different distributed systems. Kappa architecture seeks to address this concern by removing the batch layer altogether.

Instead, both the real-time views computed from recent data and the batch views computed from all data are performed within a single stream processing layer. The entire data set  —  the append-only log of immutable data — streams through the system quickly in order to produce the views with the exact computations. Meanwhile, the original “speed layer” tasks from lambda architecture are retained in kappa architecture, still providing the low-latency views with approximate computations.

This difference in kappa architecture allows for the maintenance of a single system for generating views, which simplifies the system’s codebase considerably.

Lambda Architecture via Containers on Heroku 

Coordinating and deploying the various tools needed to support a lambda architecture — especially when you’re in the starting up and experimenting stage  — is accomplished easily with Docker.

Heroku serves well as a container-based cloud platform-as-a-service (PaaS), allowing you to deploy and scale your applications with ease. For the batch layer, you would likely deploy a Docker container for Apache Hadoop. As the speed layer, you might consider deploying Apache Storm or Apache Spark. Lastly, for the serving layer, you could deploy Docker containers for Apache Cassandra or MongoDB, coupled with indexing and querying by Elasticsearch.

Use Cases and Applications 

Lambda architecture benefits a wide variety of data-driven fields. Two examples include improving machine learning and managing the data streams from IoT.

Lambda Architecture in Machine Learning 

In the field of machine learning, there’s no doubt that more data is better. For machine learning to apply algorithms or detect patterns, however, it needs to receive its data in a way that makes sense. Rather than receiving data from different directions without any semblance of structure, machine learning can benefit by processing data through a lambda architecture data system first. From there, machine learning algorithms can ask questions and begin to make sense of the data that enters the system.

Lambda Architecture for IoT 

While machine learning might be on the output side of a lambda architecture, IoT might very well be on the input side of the data system. Imagine a city of millions of automobiles, each one equipped with sensors to send data on weather, air quality, traffic, location information, driving habits, and so on. This is the massive stream of data that would be fed into the batch layer and speed layer of a lambda architecture. Therefore, IoT devices are a perfect example of providing the data within lambda architectures.

Conclusion 

Taking on the task of big data is not for the faint of heart. Scaling and system robustness are huge challenges, so paradigms like lambda architecture bring excellent guidance. As massive amounts of data stream into the data system, the batch layer provides high-latency accuracy, while the speed layer provides a low-latency approximation. Meanwhile, the speed layer responds to queries by reconciling these two views to provide the best possible response. Implementing a lambda architecture data system is not trivial. While perhaps complex and initially intimidating, the tools are available and ready to be deployed.

For additional reading, check out:

  • What Is a Data Pipeline? 
  • An Overview of Data Pipeline Architecture
Lambda architecture Architecture Big data Database Machine learning

Opinions expressed by DZone contributors are their own.

Related

  • Applying Kappa Architecture to Make Data Available Where It Matters
  • Real-Time Data Architecture Frameworks
  • AI-Powered Knowledge Graphs
  • DZone Community Awards 2022

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!