Book Review: Designing Data-Intensive Applications (Part 3)
Go in-depth into a book review about designing data-intensive applications.
Join the DZone community and get the full member experience.Join For Free
10. Batch Processing
This is the first chapter of the part of the book dealing with derived data. There is a distinction between systems of record (holds the authoritative version of the data) and derived data systems. Data in derived data systems is existing data transformed or processed in some way. For example, a cache or a search index.
The chapter starts with an example of batch processing using Unix tools. To find the five most popular URLs from an access log, the commands sort, awk, uniq, and head are piped together. Sort is actually much more powerful than I thought. If the dataset does not fit in memory, it will automatically store to disk, and it will automatically parallelize sorting across multiple CPU cores if available.
There are similarities between how MapReduce works and how the piped-together Unix tools work. They do not modify the input, they do not have any side effects other than producing the output, and the files are written once in a sequential fashion. For the Unix tools, stdin and stdout are the input and output.
MapReduce jobs read and write files on a distributed file system. Since the input is immutable, and there are no side effects, failed MapReduce jobs can just be run again. Depending on what results we want from the MapReduce jobs, there are some different kinds of joins that can be performed: sort-merge joins, broadcast hash joins, and partitioned hash joins.
Using the analogy of the piped Unix commands, MapReduce is like writing the output of each command to a temporary file. There are newer dataflow engines, like Flink, that can improve the performance over classic MapReduce. For example, by not storing to file (materializing) as often, and by only sorting when needed, as opposed to at every stage. When sorting is avoided at some steps, you also don't need the whole data set, and you can pipeline the stages.
11. Stream Processing
An event is a small, self-contained, immutable object containing the details of something that happened at some point in time. Events can, for example, be generated by users taking actions on a web page, temperature measurements from sensors, server metrics like CPU utilization, or stock prices. Stream processing is similar to batch processing but is done continuously on unbounded streams rather than on fixed-size input. In this analogy, message brokers are the streaming equivalents of a filesystem.
There are two broad categories of message brokers depending on whether they discard or keep the messages after they have been processed. Log-based message brokers (like Kafka) keep the messages, so it's possible to go back and reread old messages. This is similar to replication logs in databases and log-structured storage engines.
It can also be useful to think of writes to a database as a stream. Log compaction can reduce the storage needed while still allowing the stream to retain a full copy of the database. Representing the database as a stream allows for derived data systems such as search indexes, caches, and analytics systems to be continually up to date. This is done by consuming the log of changes and applying them to the derived system. It is also possible to create new views by starting from the beginning and consume all the events up to the present. This is also very similar to event sourcing.
Typically, there is a timestamp in every event. This timestamp is different from the time the server processes the event, and this can lead to some strange situations. For example, a user makes a web request, which is handled by web server A. Then, the user makes another request handled by web server B. Both web servers emit events, but B's event gets to the message broker first (maybe due to queueing or network faults). So the message broker sees the event from B and then the event from A even though they occurred in the opposite order.
You can also perform analytics on streams. For example, measuring the rate of something, calculating a rolling average over some time period, or comparing current statistics to previous time intervals to detect trends. Various types of windows can be used: tumbling, hopping, sliding or session. Also, just like for batch jobs, you can join stream data with database tables to enrich the event data.
12. The Future of Data Systems
In this chapter, Kleppmann describes his vision for how data systems should be designed. It is based on the ideas from chapter 11, using event streams from systems of record to create various derived views of the data. Since the derivations are asynchronous and loosely coupled, problems in one area aren't spreading to other unrelated areas like they do in tightly integrated systems. Furthermore, these types of systems can better handle mistakes. If the code processing the data has a bug, that bug can be fixed, and then the data can be reprocessed.
There is also a discussion on how internal measures, such as transactions, are not enough to protect from for example erroneously perform an operation twice. Checks need to work end-to-end from the application. For example, making sure an operation is idempotent could be done by assigning a unique identifier to it and checking that the operation is only done once for that id.
Sometimes, it is also better to be able to compensate when something goes wrong instead of putting a lot of effort into preventing it. For example, a compensating transaction, if an account has been overdrawn, or an apology and compensation if a flight has been overbooked. If it doesn't happen too often, it is acceptable for most businesses. By checking constraints asynchronously, you can avoid most coordination and still maintain integrity while also performing well.
Tied to dataflows is a discussion of moving away from request/response systems to publish/subscribe dataflows. If you are notified of all the changes, you can keep the view up to date (compare to how spreadsheets work, where changes ripple through cells). It is, however, hard to do this because assumptions of request/response are deeply ingrained in databases, libraries, and frameworks.
The last section of the chapters deals with ethics considerations when developing data handling systems. One interesting thought experiment is to replace the word data with surveillance. "In our surveillance-driven organization, we collect real-time surveillance streams and store them in our surveillance warehouse. Our surveillance scientists use advanced analytics and surveillance processing in order to derive new insights."
Throughout the book, there were lots of nuggets of information that I found really interesting. Here are a few of my favorites.
- In-memory databases are not faster than ones with disk-based storage because they can read from memory, and the traditional ones read from disk. The operating system caches recently used disk blocks in memory anyway. Instead, the speed advantage comes from not having to encode the in-memory data structures to a format suitable for writing to disk (page 89).
- The built-in hash functions in some languages are not suitable for getting partitioning keys, because the same key may have different hash value in different processes. For example Object.hashCode() in Java and Object#hash in Ruby (page 203).
- At Google, a MapReduce task that runs for an hour has a 5% risk of being terminated. This rate is more than an order of magnitude higher than the rate of failure due to hardware issues, machine reboots, etc. The reason MapReduce is designed to tolerate frequent unexpected task termination is not because the hardware is particularly unreliable, it's because the freedom to arbitrarily terminate processes enables better resource utilization in a computing cluster (page 418).
Every chapter starts with a quote. Two of them I particularly like. The first, from chapter 5, is one of my all-time favorite quotes on software development:
A complex system that works is invariably found to have evolved from a simple system that works. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work. - John Gall, Systemantics (1975)
Here is the second quote, from chapter 11:
The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair. - Douglas Adams, Mostly Harmless (1992)
These days, it feels like most systems are distributed systems in one way or another. Designing Data-Intensive Applications should almost be mandatory reading for all software developers. So many of the concepts explained in it are really useful to know.
A lot of the problems described and solved in the book come down to concurrency issues. Often, there are good pictures and diagrams illustrating the points. At the beginning of each chapter, there is a fantasy-style map that lists the key concepts in the coming chapter. I quite liked those.
Designing Data-Intensive Applications is thick — a bit over 550 pages. This made me hesitate to start it — it almost felt too imposing. Luckily, we picked it for the book club at work this spring. That gave me enough of a nudge to get started and to keep going. I am really happy I started because there is so much good information in it. I particularly like how it is both theoretical and practical at the same time.
If you liked this summary, you should definitely read the whole book. There are so many more details and examples, and they are all very interesting. Highly recommended!
Opinions expressed by DZone contributors are their own.