Why Mutability Is Essential for Real-Time Data Analytics

Readers will learn how mutability enables updates to existing records in a data store and why it is essential for real-time analytics to better infrastructure.

Dhruba Borthakur

Sep. 15, 22 · Analysis

Likes (1)

Comment

Save

8.4K Views

Successful data-driven companies like Uber, Facebook, and Amazon rely on real-time analytics. Personalizing customer experiences for e-commerce, managing fleets and supply chains, and automating internal operations require instant insights into the freshest data.

To deliver real-time analytics, companies need a modern technology infrastructure that includes three things:

A real-time data source such as web clickstreams, IoT events produced by sensors, etc.
A platform such as Apache Kafka/Confluent, Spark, or Amazon Kinesis for publishing that stream of event data.
A real-time analytics database capable of continuously ingesting large volumes of real-time events and returning query results within milliseconds.

Event streaming/stream processing has been around for almost a decade. It’s well understood. Real-time analytics is not. One of the technical requirements for a real-time analytics database is mutability. Mutability is the superpower that enables updates, or mutations, to existing records in your data store.

Differences Between Mutable and Immutable Data

Before we talk about why mutability is key to real-time analytics, it’s important to understand what it is.

Mutable data is data stored in a table record that can be erased or updated with newer data. For instance, in a database of employee addresses, let’s say that each record has the name of the person and their current residential address. The current address information would be overwritten if the employee moves residences from one place to another.

Traditionally, this information would be stored in transactional databases — Oracle Database, MySQL, PostgreSQL, etc. — because they allow for mutability: Any field stored in these transactional databases is updatable. For today’s real-time analytics, there are many additional reasons why we need mutability, including data enrichment and backfilling data.

Immutable data is the opposite — it cannot be deleted or modified. Rather than writing over existing records, updates are append-only. Meaning updates are inserted into a different location, or you’re forced to rewrite old and new data to store it properly. More on the downsides of this later. Immutable data stores have been useful in certain analytics scenarios.

The Historic Usefulness of Immutability

Data warehouses popularized immutability because it eased scalability, especially in a distributed system. Analytical queries could be accelerated by caching heavily-accessed read-only data in RAM or SSDs. If the cached data was mutable and potentially changing, it would have to be continuously checked against the original source to avoid becoming stale or erroneous. This would have added to the operational complexity of the data warehouse; immutable data, on the other hand, created no such headaches.

Immutability also reduces the risk of accidental data deletion, a significant benefit in certain use cases. Take health care and patient health records. Something like a new medical prescription would be added rather than written over existing or expired prescriptions so that you always have a complete medical history.

More recently, companies tried to pair stream publishing systems such as Kafka and Kinesis with immutable data warehouses for analytics. The event systems captured IoT and web events and stored them as log files. These streaming log systems are difficult to query, so one would typically send all the data from a log to an immutable data system such as Apache Druid to perform batch analytics.

The data warehouse would append newly-streamed events to existing tables. Since past events, in theory, do not change, storing data immutably seemed to be the right technical decision. While an immutable data warehouse could only write data sequentially, it did support random data reads. That enabled analytical business applications to efficiently query data whenever and wherever it was stored.

The Problems With Immutable Data

Of course, users soon discovered that, for many reasons, data does need to be updated. This is especially true for event streams because multiple events can reflect the true state of a real-life object. Also, network problems or software crashes can cause data to be delivered late. Late-arriving events need to be reloaded or backfilled.

Companies also began to embrace data enrichment, where relevant data is added to existing tables. Finally, companies started having to delete customer data to fulfill consumer privacy regulations such as GDPR and its “right to be forgotten.”

Immutable database makers were forced to create workarounds to insert updates. One popular method used by Apache Druid and others is called copy-on-write. Data warehouses typically load data into a staging area before it is ingested in batches into the data warehouse, where it is stored, indexed, and made ready for queries. If any events arrive late, the data warehouse will have to write the new data and rewrite already-written adjacent data to store everything correctly in the right order.

Another poor solution to deal with updates in an immutable data system is to keep the original data in Partition A (above) and write late-arriving data to a different location, Partition B. The application, and not the data system, will have to keep track of where all linked-but-scattered records are stored, as well as any resulting dependencies. This process is called referential integrity and has to be implemented by the application software.

Both workarounds have significant problems. Copy-on-write requires data warehouses to expend a lot of processing power and time — tolerable when updates are few, but intolerably costly and slow as the number of updates rises. That creates significant data latency that can rule out real-time analytics. Data engineers must also manually supervise copy-on-writes to ensure all the old and new data is written and indexed accurately.

An application implementing referential integrity has its own issues. Queries must be double-checked to ensure they are pulling data from the correct locations or run the risk of data errors. Attempting any query optimizations, such as caching data, also becomes much more complicated when updates to the same record are scattered in multiple places in the data system. While these may have been tolerable at slower-paced batch analytic systems, they are huge problems when it comes to mission-critical real-time analytics.

Mutability Aids Machine Learning

At Facebook, we built an ML model that scanned all-new calendar events as they were created and stored them in the event database. Then, in real-time, an ML algorithm would inspect this event and decide whether it is spam. If it is categorized as spam, then the ML model code would insert a new field into that existing event record to mark it as spam. Because so many events were flagged and immediately taken down, the data had to be mutable for efficiency and speed. Many modern ML-serving systems have emulated our example and chosen mutable databases.

This level of performance would have been impossible with immutable data. A database using copy-on-write would quickly get bogged down by the number of flagged events it would have to update. If the database stored the original events in Partition A and appended flagged events to Partition B, this would require additional query logic and processing power, as every query would have to merge relevant records from both partitions. Both workarounds would have created an intolerable delay for our Facebook users, heightened the risk of data errors, and created more work for developers and/or data engineers.

To sum up, mutability is key for today’s real-time analytics because event streams can be incomplete or out of order. When that happens, a database will need to correct and backfill missing and erroneous data. To ensure high performance, low cost, error-free queries, and developer efficiency, your database must support mutability.

If you want to see all of the key requirements of real-time analytics databases, watch my recent talk at the Hive on Designing the Next Generation of Data Systems for Real-Time Analytics, available below.

Analytics Data warehouse Database Data (computing) Event kafka systems

Published at DZone with permission of Dhruba Borthakur. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending