We all know that our new fancy, high-performance computer systems with hundreds or even thousands of processors are designed and built to provide speedier answers to very large computational problems. But who would think that advances induced by our smart phones and thin laptops conspire against the assumptions made by HPC platform developers?
Okay, here's what happened. First let's examine the HPC side of the problem. The problems these massively parallel computers work best on involve huge arrays of data and a common algorithm that is applied to each point (and usually some of its near neighbors). This processing results in another huge array that becomes the new input and the process is done again. These sorts of loops can involve trillions of data points (which may be vectors with dozens or even thousands of dimensions) and processing loop counts in the millions. This is exactly what a weather simulation problem looks like and it is the most common style of problem that these giant machines are used for.
Now in order to gain speed, and by that I mean reduce the latency of the result, it seems natural to break the problem data set into chunks and process those chunks on separate processors in ... parallel. Then the system re-stitches all the results chunks back into a huge result array, and uses it for input. Rinse and repeat. This has worked well ... up until now. There is an underlying assumption in this approach that was perfectly valid when people started building these multiprocessor computers. And the assumption is so simple that even a young child will make decisions based on it. Namely:
Assumption: Equal work executes in equal time.
The child knows instinctively that doing a task (building a tower of five blocks, filling a bucket with sand, etc.) will take a unit of time. And if they do the task twice it will take about two units of time. When it comes to parallelizing the computational work on these huge arrays in the simulation task, the unspoken assumption is that if you break the huge array into a collection of smaller equal sized arrays, then each of the smaller arrays can be computed in parallel in approximately the same time. Once they are all computed they are reassembled and the process continues. Minor variations in the time it takes to compute each individual smaller array are inevitable, but since they are usually quite minor the negative impact is much smaller than the overall positive impact. It is an acceptable engineering trade-off.
Now let's look at what's happening with battery-operated devices and the overall green/efficiency movement in computation. In the case of our smart phones, tablets, and laptops it turns out that they are not doing much work most of the time. So, if they can idle at low speed when they're not needed then they can save power which translates to battery life. That's good. And if these devices are presenting information to humans with a latency of 1 ms versus 50 ms is hardly noticeable. So, if we can slow down the processor and not create any apparent latency for the human, then we also save power. So a lot of chip development effort has been focused on automatically throttling the CPUs on these devices to minimize power usage without being very noticeable to the user. When it comes to the green/efficiency movement the story is a little different but still centers on power consumption. It turns out that our modern workstation CPU processor chips can run so fast that they will overheat and destroy themselves if not restrained. For short bursts they can sprint at their full rated speed but then they must slow down significantly (or even stop) in order to cool off. These chips have their own "on-chip" temperature sensors and can make their own decisions about when and how long they need to "cool off". This self-regulating technology goes by the name "Dynamic Voltage and Frequency Scaling" (DVFS).
Most of these HPC supercomputers are assembled from large arrays of commodity processors, RAM, disk drives, etc. So, these new smarter and more efficient CPU chips are finding their way into our new supercomputers. At this point you can probably guess what the problem is. The current parallelization of these "grid" types of problems presumes that all of the partial results will come back in a timely fashion. So, even though you've broken the problem into a huge number of smaller problems the slowest process time for one of those smaller arrays will govern the loop time over the next cycle for the reassembled huge array. Let's do a thought experiment: imagine we have 10,000 chunks to process and most of them are done in 1 ms, but one of them takes 2 ms. Then 9999 chunk/processes will be waiting for that one slow process before they can completely reassemble the new result array. If all of the chunks were processed in 1 ms then that loop cycle would take 10,000×1 ms (10 seconds) of total processor availability and would complete in 1 ms. If one of the processes took 2 ms and the other processes had to wait for it then that loop cycle would take 10,000×2 ms (20 seconds) of total processor availability and it would complete in 2 ms. In the first case the supercomputer would be 10,000 times faster than a single CPU and in the second case it would only be 5000 times faster than a single CPU. Half of the processing potential of the computer would be lost because a single process needed to slow down. Just imagine if it took 4ms or 5ms.
Before everyone blasts me with comments, I know that I have simplified the description of the processing costs by ignoring the marshaling overhead, the details of data set decomposition (usually octrees for weather simulations). But the problem is quite real. Until recently most CPU chips could be set up to disable the newfangled DVSF. And most supercomputer builders did just that to solve the problem. Because these chips will overheat if they run at sprint speed all the time, the designers' only option is to reduce the clock frequency to a safe speed and boost the cooling efficiency of the physical chips. And so these new HPC computers gain predictability in processing latency at the expense of attaining their maximum processing speed.
Some of the newer HPC development efforts are experimenting with what are essentially CPUs designed originally for the mobile environment. They are very efficient and inexpensive and have useful computational power, but they are also more autonomous in deciding how to maintain their efficiency. Therein lies the rub.
Probably the very best solution to this problem is to develop smarter parallel programs. Parallelization is, after all, in its infancy.
Conclusion: Better parallel programs are needed. So get to work (you know who you are)!