One great thing about working for Hortonworks is that you get to try out new features before they are released, with real feature engineers as tour guides—features like LLAP, which stands for Live Long and Process. LLAP is coming to Hortonworks with Hive 2.0 and (spoiler alert) it looks like it will be worth the wait.
Originally a pure batch processing platform, Hive has sped up enormously over the last couple of years with Tez, the cost-based optimizer (CBA), ORC files, and a host of other improvements. Queries that once took minutes now (1st quarter 2016) run at interactive speeds, and LLAP aims to push latencies into the sub-second range.
So, what is LLAP exactly? The answer is complicated, but one thing LLAP is not is a replacement for Tez. The two work together, and Tez will continue to manage the coordination of the Hive processing DAG with or without LLAP. The illustration below (cribbed from Hortonworks) shows the relationship between MR, Tez, and LLAP. A fourth block might have also been shown in which all processing takes place within LLAP daemons.
The key thing to keep in mind is that conceptually Tez containers handle a single mapper or reducer task while the units of LLAP processing are long-lived daemons that can serve multiple tasks, both in series and in parallel. (I say "conceptually" because Tez actually can reuse containers to save startup costs, but each use of the container is new.)
LLAP processing compliments query optimization. Think of optimization like this: some of what the CBO does is purely logical, for instance, applying algebraic transformations to the query logic to produce a more efficient but equivalent sequence of operations. Other CBO optimizations add empirical information to the logic, taking into account the statistics of datasets, for example, noticing that certain tables are small enough to pull into memory for a map-side join.
At a still lower level than the CBO’s logical optimizations are optimizations to what engineers call the "physical algebra" i.e., the operators, metadata, and I/O that are used to instantiate the higher-level logical operations. It is at this physical level that LLAP does its work. The LLAP daemons are accessed using a fragment-oriented API that is geared toward handling the chunks of computation—input, output, processing operations, and metadata—that constitute the physical algebra of a query.
Among the most important and best-known of LLAP’s optimizations is keeping data live in memory so that it can be used in multiple queries. In addition to reducing disk-I/O, this saves the considerable CPU cost of decompressing and unmarshalling the data repeatedly.
There are also numerous other efficiencies including the following:
- Processing multiple fragments in a long-lived daemon saves on container restart costs and latency.
- Unlike MR or Tez processing, which typically interleave I/O operations and processing, LLAP uses a multi-threaded I/O "elevator" model, which executes data reads, decoding, and processing on separate threads in order to keep the processing threads busier. (Picture independent elevators side by side, as opposed to one shared elevator.)
- Processing multiple fragments within a single daemon means that the cost of JIT compiling is amortized over more executions of the same code (with MR or Tez, each container, i.e, VM, compiles independently.)
- In addition to sequential sharing (i.e. caching), stateless daemons can share access to data among multiple parallel threads.
- Off-heap data caching reduces GC overhead and resulting disruption.
- Metadata and indexes are also cached and shared across multiple fragments, as are hashtables for map-side joins.
Three Kinds of Speed
All items mentioned above speed up processing, but "speeding up" a system actually refers to three different properties that can be traded off against each other. The first, throughput, refers to how much data passes through the system in a given time. Throughput is a rate, e.g., GB/sec. The second, latency, refers to the average amount of time it takes for work to go end to end. Latency is not a rate, but a duration of time, e.g., X seconds to complete a job. The bulleted items above tend to improve both of these, but they are actually independent. For instance, a batch processing system might process an awesome volume of data (i.e., high throughput), yet enqueue jobs for a long time (poor latency.)
A third property, responsiveness, is more subjective. It refers to how quickly and consistently a system turns tasks around. Like latency, it concerns time, not rate, but it is about minimizing the longest wait-time for completion, rather than holding wait-time to the strictest minimum, and there is usually a tradeoff between throughput and responsiveness because a responsive system needs to have some idle resources available. Similarly, task preemption, in addition to reducing throughput, tends to increase the average latency slightly in order to hold the maximum down.
LLAP speeds up Hive in all three ways, but the goal of sub-second queries emphasizes responsiveness above of all. In service of this, LLAP’s multi-threaded daemons can accept, reject, or enqueue incoming work and process multiple fragments in parallel, which not only facilitates efficient scheduling but lowers the cost of task preemption.
Other Quirks and Properties
LLAP’s processing modes are flexible: a job can still run entirely in Tez, or the processing can be divided between Tez containers and LLAP executors, or it can run entirely within LLAP. You can turn LLAP on and off, either for the whole cluster or per-job.
Some of the advantages that are anticipated eventually will not be included in the first release, including support for ACID behavior, in-daemon access control mechanisms, and support for concurrent query execution. The first release will emphasize stability, with these extra features following.
One subtle disadvantage of sharing data in the daemons is that it makes user-defined functions (UDF’s) risky. Therefore, unfortunately, they must be disallowed for LLAP, at least in the initial release. They will remain legal under Tez, of course.
Note that LLAP (at least for now) works only with ORC format files, which is a natural pairing with LLAP’s column oriented cache. (But, if ORC isn’t already your default format for all things Hive, shame on you!)
How Does It Perform?
There is no simple answer to the question of how much speedup LLAP provides because the answer inherently depends on the workload.
For example, it is trivial to come up with a query for which LLAP will not be at its best. Any one-off computation on a large volume of data, for which Hive has to write as much as it reads, is in this category. Sorting and ETL jobs tend to be like this. Such jobs will typically be dominated by network traffic first, data writes and replication second, and reading/computation only third. The key strengths of LLAP are relatively little in play in this kind of work because the jobs are bottlenecked on the network and on disk writes, not on data reading, and especially not on repeatedly reading the same data. Some LLAP features, such as the multi-threaded reads, still help, but it’s not LLAP’s sweet spot.
On the other hand, even on monster clusters, the vast majority of jobs deal with datasets of from gigabytes to a few terabytes, and even a small cluster will have from five to twenty TB of memory, and often much more. Moreover, in enterprise computing, using the same tables repeatedly is the common case, not the exception, and business queries tend to be heavy on operations like aggregation and extracting a small amount of data out of a large data set. Thus, the most typical business use-case is right up LLAP’s alley.
There is an infinite variety of queries and datasets, but in general, what we see with LLAP in its current state is:
- A speedup factor of from 7x to 10x on smaller queries, aggregations, and queries that select a small slice of data, even if the logic is complex.
- A factor of 2x or 3x for large queries.
- A factor of 1.5 to 2.0 for queries dominated by large output.
LLAP is friendly to concurrent loads, especially if the processing involves overlapping datasets.
Interestingly, the size of the increase for queries with large output, though relatively small, is not as small as one might guess, given that caching is ineffective. Other optimizations are in play, however, for example, multi-threading and other optimizations allow for more continuous streaming. Increased streaming reduces the number of time-consuming seeks, and fewer seeks means disks can stream more data in a given interval of time.
It is sometimes overlooked that you can also express an efficiency increase the other way around. LLAP speeds up queries, but you can say just as correctly that LLAP allows you to get the same amount of work out of a smaller cluster. Or, you can express it a third way, that you can serve more jobs/users from the same size cluster. It’s a question for psychologists why people generally express efficiency increases only in terms of running faster!
The State of LLAP Today
While LLAP still has a way to go before it is GA, it has been tested extensively on production scale clusters with realistic workloads. The fundamentals are clearly there. Much of the ongoing work seems to be in polishing up deployment, stability, etc., as well as honing features relating to security, data sharing, and the like, which are particularly important because of the ability to share data across queries.
I recently had a chance to personally participate in tests on a 24 node cluster using queries and data that have been extensively used in the past with Hive over plain Tez, and saw results very much in keeping with those cited above—if anything, at the high end, with the majority of queries speeding up by roughly 10X, and with many moving from the finger-drumming time scale to the sub-second scale. Results were also excellent for multiple concurrent users applying heavy loads.
The first version of LLAP will ship with Hive 2.0, which will be part of the Hortonworks 3.0 release around the end of 2016 or early 2017.