[This article appears in the DZone Guide to Performance & Monitoring – 2015 Edition. For additional information including insight from industry experts and luminaries, performance statistics and strategies, and an overview of how modern companies are handling application monitoring, download the guide below.]
I have spent the last few years of my career working in the field of high-performance, low-latency systems. Two observations struck me when I returned to working on the metal:
Many of the assumptions that programmers make outside this esoteric domain are simply wrong.
Lessons learned from high-performance, low-latency systems are applicable to many other problem domains.
Some of these assumptions are pretty basic. For example, which is faster: storing to disk or writing to a cluster pair on your network? If, like me, you survived programming in the late 20th century, you know that the network is slower than disk, but that’s actually wrong! It turns out that it’s much more efficient to use clustering as a backup mechanism than to save everything to disk.
Other assumptions are more general and more damaging. A common meme in our industry is “avoid pre-optimization.” I used to preach this myself, and now I’m very sorry thatI did. These are just a few supposedly obvious principles that truly high-performance systems call into question. For many developers, these rules of thumb appear reasonably inviolable in everyday practice. But as performance demands grow increasingly strict, it becomes proportionally important for developers to understand exactly how systems work—at both the abstract, procedural level, and the level of the metal itself.
MOTIVATION: THE REPERCUSSIONS OF INEFFICIENT SOFTWARE
First, let’s think about why it’s important to transcend crude oversimplifications like “disk I/O is faster than network I/O.”
One motivation is a bit negative. I can think of noother field of human endeavor that tolerates thelevels of inefficiency that are normal in our industry.My experience has been that for most systems, any performance specialist can improve its performance ten-fold quite easily. This is because most applications are hundreds, thousands, even tens of thousands of times less efficient than they could be. Modern hardware is phenomenally capable, but we software folks regularly underestimate the strides hardware manufacturers have made. As a result, we often fail to take advantage of new hardware capabilities and miss significant shifts in the locations of common performance bottlenecks.
You may say this doesn’t matter—after all, what is all that hardware performance for, if not to make it easier to write software? And then I’ll reply: it’s for lots of things. Here’s one very concrete reason. The energy consumption from data centers constitutes a significant fraction of the CO2 being pumped into our atmosphere . If most software is more than 10x less efficient than it could be, then that could mean 10x more CO2. It also means 10x the capital cost in hardware required to run all that inefficient software.
“You don’t have to be an engineer to be be a racing driver, but you do have to have Mechanical Sympathy.” – Jackie Stewart, racing driver
Perhaps more important is that performance is about more than just efficiency for its own sake. Performanceis also an important enabler of innovation. The ability to pack the data representing 1000 songs onto the tiny hard disk in the first generation iPod made it possible for Apple to revolutionize the music industry. For an even more basic example, graphical user interfaces only became feasible when the hardware was fast enough to “waste” all that time drawing pretty pictures. High-performance hardware enabled these innovations; the software simply had to follow.
Then there’s the opportunity cost. What could we be doing with all the spare capacity of our astonishingly fast hardware and enormously vast storage if we weren’t wasting it on inefficient software?
In the problem domains where I’ve worked, the idea of mechanical sympathy has helped enormously. Let’s examine this idea a little closer.
KEY CONCEPT: MECHANICAL SYMPATHY
The term Mechanical Sympathy was coined by racing driver Jackie Stewart and applied to software by Martin Thompson. Jackie Stewart said, “You don’t have to be an engineer to be be a racing driver, but you do have to have Mechanical Sympathy.” He meant that understanding how a car works makes you a better driver. This is just as true for writing code. You don’t need to be a hardware engineer, but you do need to understand how the hardware works and take that into consideration when you design software.
Let’s take something simple like writing a file to disk. Disks are random access, right? Well, not really. Disks work by encoding data in sectors and addressing them in segments of the disk. When you read or write data to disk, youneed to wait for the heads to move to the correct physical location. If you do this randomly, then you will incur performance penalties as the heads are physically moved across the surface of the disk and you wait for the spin of the disk to place the sector you want beneath the read/ write heads. This averages out to about 3ms per seek for the heads, and about 3ms rotational latency. That makes for an average total of about 6ms per seek!
This seek time is dramatically slow when compared tothe electronic. Electrons beat spinning rust every time! And there are even more specific reasons why “random access” is a poor rule of thumb. In fact, modern disks are optimized to stream data so that they can play movies and audio. If you understand that, and treat your file storage as a serial, block device rather than random access, you will get dramatically higher performance. In this example, the difference can be two orders of magnitude (200MB/s is a good working figure for the upper bound).
REACHING HIGH PERFORMANCE: DO EXPERIMENTS, DO ARITHMETIC
Abstraction is important. It is essential that we are, to some degree, insulated from the complexity of the devices and systems that form the platform upon which our software executes. However, many of the abstractions that we take for granted are very poor and leaky. The overall system complexity gets amplified when we build abstraction on top of abstraction to hide the problems caused by the leakiness, resulting in the dramatic performance differences we see between high-performance and “normal” systems.
You don’t need to model or measure your entire systemto tackle its complexity. The starting point is to do some experimenting to understand your theoretical maximums. You should have a rough model of the following:
How much data you can write to disk in one second
How many messages your code can process in one second
How much data you can send or receive across your network in one second
A common meme in our industry is, “avoid pre- optimization.” I used to preach this myself. I am now very sorry that I did.
The idea here is to measure your actual performance against the theoretically possible performance of your system within the limits imposed by the underlying hardware.
Let’s consider a specific example. I recently talked witha developer who believed he was working on a high- performance system. He told me that his system was working at 10 transactions per second. As of mid-2015, modern processors on commodity hardware can easily perform 2-3 billion instructions per second. So at 2-3 billion instructions per second, the developer was limited to 200- 300 million instructions per transaction ([10 transactions/ sec] / [2-3G instructions/sec] = 200-300m instructions/ transaction) to achieve his goals. Of course, his application isn’t responsible for all of these—the operating system is chewing some cycles too—but 200 million instructions is an awful lot of work. You can write an effective chess player in 672 bytes, as David Horne has shown.
Okay, so maybe the bottleneck was I/O. Perhaps this developer’s application was disk bound. That’s unlikely— modern hard disks found in commodity hardware can transfer data at phenomenal rates. A moderate transfer rate from disk to a buffer in memory is about 10MB/s. If this is the limit, he must be pushing more than 10MB per transaction between disk and memory. You can represent a lot of information in 10 million bytes!
Well, perhaps the problem was the network. Probably not—10Gbit/s networks are now on the low end. A 10Gbit/s network can transmit roughly 1GB of data per second.This means that each of our misguided developer’s 10 transactions per second is occupying 1/10th of a GB, or approximately 100MB—more than 10 times the throughput of our disks!
The hardware wasn’t the problem.
HIGH PERFORMANCE SIMPLICITY I: MODEL THE PROBLEM DOMAIN
One of the great myths about high-performance programming is that high-performance solutions are more complex than “normal” solutions. This is just not true.By definition, a high-performance solution must do the most amount of work in the fewest instructions. Complex solutions precisely don’t last long in high-performance systems because complexity is where performance bottlenecks hide.
The best programmers I know achieve the simplicity demanded by high-performance systems by modeling the problem domain. A software simulation of the problem domain is the best way to come up with an elegant solution.
HIGH PERFORMANCE SIMPLICITY II: SIMPLIFY THE CODE
If you follow the lead of the problem domain, you tend to end up with smaller, simpler classes with clearer relationships. If you follow strong separation of concerns as a guiding principle, then your design will push you in the direction of cleaner, simpler code. The cardinal rule of object-oriented simplicity is “one class, one thing; one method, one thing.”
Modern compilers are extremely effective at optimizing code, but they are best at optimizing simple code. If you write 300-line methods with multiple for-loops, each containing several nested if-conditions throwing exceptions all over the place and returning from multiple points, then the optimizer will simply give up. If you write small, simple, easy to read methods, they are easier to test and easier for the optimizer to understand, which results in significant improvements to performance.
One of the great myths about high- performance programming isthat high-performance solutions are more complex than “normal” solutions. This is just not true.
THE KEY TO HIGH PERFORMANCE: MINIMIZE INSTRUCTIONS, MINIMIZE DATAFundamentally, developing high-performance systemsis simple: minimize the number of instructions being processed and the amount of data being shunted around.
You achieve this by modeling the problem domainand eliminating nonessential complexity. For software developers, “Mechanically Sympathizing” with modern high-performance hardware will help you understand where you’re doing things wrong. It can also point you in the direction of better measurement and profiling, which will help you understand why your code is not performing to its theoretical maximum. So why not try to save your company some money, work in a simplified codebase,and maybe help reduce the carbon footprint of our data centers—all at the same time?