The release of version 1.0 marks another major milestone for Apache Storm. Since becoming an Apache project in Sept 2013, much work has gone into maturing the feature set and also improving performance by reworking or tweaking various components. (See A Brief History of Apache Storm.)
Some of the notable changes that have contributed to improved performance are:
- Switch from ZeroMQ to Netty for inter-worker messaging.
- Employing batching in disruptor queues (used for intra-worker messaging).
- Optimizations in the Clojure code, such as employing type hints and reducing expensive Clojure look-ups in performance sensitive areas.
In this post, we will take a look at performance improvements in Storm since its incubation into Apache. To quantify this, we compare the performance numbers of Storm v.0.9.0.1, which was the last pre-apache release, with the most recent release—Storm v1.0.1. Storm v.0.9.0.1 has also been used as a reference point for performance comparisons against Heron.
Given the existence of recent efforts to benchmark Storm “at scale," here we shall examine performance from a different angle. We narrow the focus to some specific core areas of Storm using a collection of simple topologies. To contain the scope, we have limited the scope to Storm core (i.e. no Trident).
Each topology was given at least four mins of “warm up” execution time before taking measurements. Subsequently, after a minimum of 10 minutes, metrics were captured from the Web UI for the last 10-minute window. The captured numbers have been rounded off for readability. In all cases, ACK-ing was enabled with one ACKer bolt executor. Throughput (i.e tuples/sec) was calculated by dividing the total ACKs for the 10-minute window by 600.
Due to some backward incompatibilities (mostly namespace changes) in Storm, two versions of the topologies were written, one for each Storm version. As a general principle, we have avoided configuration tweaks to tune performance and stayed with default values. The only configuration setting we applied was to set the maximum heap size of the worker to 8GB to ensure memory.
Five-node cluster (one nimbus and four supervisor nodes) running Storm v0.9.0.1
Five-node cluster (one nimbus and four supervisor nodes) running Storm v1.0.1
Three-node Zookeeper cluster
All nodes had the following configuration:
CPU: Sockets: two sockets, six cores per socket, hyper-threaded. Model: two sockets x six cores x two hyper threads = 24). Intel Xeon CPU E5-2630 0 @ 2.30GHz.
Memory: 126 GB
Network: 10 GigE
Disk: Six disks. Each 1TB. 7200 RPM.
1. Spout Emit Speed
Here we measure how fast a single spout can emit tuples.
Topology: This is the simplest topology. It consists of a ConstSpout that repeatedly emits the string “some data” and no bolts. Spout parallelism is set to 1, so there is only one instance of the spout executing. Here we measure the number of emits per second. Latency is not relevant as there are no bolts.
Spout Emit Speed: How fast a single spout can emit tuples Storm version.
Emits Rate Measurements:
v0.9.0.1: 108 k tuples/sec
v1.0.1: 3.2 million tuples/sec
2. Messaging Speed (Intra-worker)
Measure the speed at which tuples can be transferred between a spout and a bolt running within the same worker process.
Topology: Consists of a ConstSpout that repeatedly emits the string “some data” and a DevNull bolt which ACKs every incoming tuple and discards them. The spout, bolt, and acker were given one executor each. The spout and bolt were both run within the same worker.
Intra-Worker Messaging: Speed of tuple transfer between a spout and a bolt within the same worker process Storm version.
v0.9.0.1: 87k/sec, 16 ms
v1.0.1: 233 k/sec, 3.4 ms
3. Messaging Speed (Inter-worker One)
The goal is to measure the speed at which tuples are transferred between a spout and a bolt, when both are running on two separate worker processes on the same machine.
Topology: The same topology as the one used for Intra-worker messaging speed. The spout and bolt were, however, run on two separate workers on the same host. The bolt and the acker were observed to be running on the same worker.
Inter-Worker One Messaging Speed: Spout and Bolt run on two separate works on same host, bolt and acker on the same worker Storm version.
v0.9.0.1: 48 k/sec, 170ms
v1.0.1: 287 k/sec, 8ms
4. Messaging Speed (Inter-worker Two)
The goal is to measure the speed at which tuples are transferred when the spout, bolt, and acker are all running on separate worker processes on the same machine.
Topology: The same topology as the one used for Intra-worker messaging speed. The spout, bolt and acker were however run on three separate workers on the same host.
Inter-Worker Two Messaging Speed: Tuples transferred when spout, bolt, and acker all running on separate worker process on the same machine Storm version (Links to Topology)
v0.9.0.1: 43 k/sec, 116 ms
v1.0.1: 292 k/sec, 8.6 ms
5. Messaging Speed (Inter-host One)
The goal is to measure the speed at which tuples are transferred between a spout and a bolt, when both are running on two separate worker processes running on two different machines.
Topology: The same topology as the one used for Intra-worker messaging speed but the spout and bolt were run on two separate workers on two different hosts. The bolt and the acker were observed to be running on the same worker.
Interhost Messaging Speed: Speed between a Spout and a Bolt, each on a different worker process on two different machines Storm version.
v0.9.0.1: 48 k/sec, 845 ms
v1.0.1: 316 k/sec, 13.3 ms
6. Messaging Speed (Inter-Host Two)
Here we measure the speed at which tuples are transferred when the spout, bolt, and acker are all running on separate worker processes on three different machines.
Topology: Again, the same topology as Inter-host One, but this time the acker runs on a separate host.
Inter-Host Messaging Speed (Second Scenario): Tuples transferred when the spout, bolt, and acker are all running on separate worker processes on three different machines.
v0.9.0.1: 50 k/sec, 1.7 seconds
v1.0.1: 303 k/sec, 7.4 ms
The changes mentioned at the beginning of this blog are major contributors to the improvements we see here. The impact of switching from ZeroMQ to Netty would be visible when there is more than one worker (which is often the case) while the other changes should impact almost any topology. Additionally, the introduction of the pacemaker process to optionally handle heartbeats for alleviating the load on Nimbus is worth noting. Although it doesn’t impact raw performance of individual worker, it helps scaling topologies on large clusters.
Numbers suggest that Storm has come a long way in terms of performance. But it still has room go faster. Here are some of the broad areas that should improve performance in future:
An effort to rewrite much of Storm’s Clojure code in Java is underway. Profiling has shown many hotspots in Clojure code.
Better scheduling of workers. Yahoo is experimenting with a Load Aware Scheduler for Storm to be smarter about how topologies are scheduled on the cluster.
Based on microbenchmarking and discussions with other Storm developers, there appears potential for streamlining the internal queuing for faster message transfer.
Operator coalescing (executing consecutive spouts/bolts in a single thread when possible) is another area that reduces intertask messaging and improves throughput.