Neural Nets Inside the CPU: Self-Awareness Breeds Speed

DZone 's Guide to

Neural Nets Inside the CPU: Self-Awareness Breeds Speed

There are different ways to speed up a CPU. We can increase the clock speed, but we've just about reached the practical limits on that front. We can add more cores, but we're still not very good at parallelizing our algorithms. On another front, optimizing code branching, we can still make some solid improvements.

· Performance Zone ·
Free Resource

Everyone knows that the CPU is the focal point of every computer. In the classical Von Neumann architecture the CPU is usually the bottleneck that throttles our dreams of speed. True, most of our new CPUs have 2, 4, or even a couple of dozen "cores" but even so each one of those is burdened with the problem of branching in the program. Whether it's an if-then statement or a case statement the CPU must test the possible paths in order to decide which one to follow. Since the CPU is inherently a serial device that can only evaluate one possibility at a time before the program can advance. With modern compilers, which try to help us by doing things such as pre-computing parts of the test ahead of time if possible (e.g. the addition of static values, or do very easy tests first), you can't even be sure which order the branches will be tested.

Note: in the early days of compilers for languages such as Fortran the tests were always done in a first-come-first-served fashion just as they were written. So if you knew that one branch would happen 90+% of the time you could write your code with that branch first and get a noticeable speed up.

So, today's compilers can on average provide some improvement, but not as much as if you could predict most if-then tests at better than 50% odds. Here's where the neural learning approach helps.

Sometimes when we're writing a program we have a sense of the probabilities at each branch point, but usually we don't have a simple model for that information. That doesn't mean it couldn't be gleaned by studying the problem and doing a little side research but that usually is not cost-effective unless it is part of some hugely time-sensitive, mission-critical code (e.g. embedded logic for a hydraulic gimbal actuator for a rocket engine where it's important to know exactly how long the computation will take).

What if we considered the state of the CPU as a set of input parameters and consider the branch results as the desired output. Then we can approach this problem as a standard machine learning task. Without having to understand how to model the specific nature of the branch we may be able to predict our branch at better than even odds. If it seems a little improbable that looking at bits distributed over the CPU could tell you much about an upcoming branch, then think about how deep learning can distinguish between kittens and giraffes. These systems are not given models of mammalian structure on which to base their decisions. These systems are given pixels and a label "kitten" or "giraffe". (And of course a vast number of input examples!)

Not to keep you in suspense, it turns out that there are patterns that can be learned. And with a little extra circuitry on the CPU chip combined with the previously learned patterns, it is possible to make those branch predictions quickly enough to be used advantageously during computations.

Dr. Daniel Jiménez, professor in the Department of Computer Science and Engineering In the Texas Architecture and Compiler Optimization (TACO) lab at Texas A&M University, has devoted much of his time to this particular problem and he has revolutionized this technology. Dr. Jiménez's prediction mechanisms and circuitry are among the most accurate available today. And they're not just theoretical, they are seeing use in Advanced Micro Devices (AMD) and other CPUs today. He has received the NSF CAREER grant for this work.

It turns out that this type of prediction is not limited to branching. It is also very useful for memory cache management. The CPU on-chip cache memory operates at much faster read/write speeds than the DRAM chips on the motherboard. Most of us are aware that the DRAM memory is copied in small blocks to and from the L1 and L2 memory caches. But, this only speeds up the program if the innermost (and fastest) ring of cache memory holds the data that the program is requesting. It is a prediction problem: what data is pushed back to DRAM and what data is pulled up from the RAM. (Memory cache algorithms always remind me of that song by The Clash.) 

Jiménez's lab has produced improved dead block predictors and other cache management policies using techniques similar to the branching problem. Their most recent paper, “Minimal Disturbance Placement and Promotion,” at the 2016 HPCA conference introduced a policy that uses minimal hardware overhead to manage the cache and which is competitive with state-of-the-art policies which have larger overheads.

The 2009 issue of IEEE Micro’s "Top Picks from Computer Architecture Conferences" highlighted this work which was done in collaboration with Dr. Doug Burger and Dr. Renée St. Amant from the University of Texas at Austin.

compiler, cpu, memory cache

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}