Over a million developers have joined DZone.

TPUs for Everyone

DZone 's Guide to

TPUs for Everyone

You may have thought the GPGPUs were the fastest way to do machine learning. It may be time to tighten your seatbelt. TPUs run in an even faster lane.

· AI Zone ·
Free Resource

A lot of people are using TensorFlow these days. Some of us are experimenting at the entry level of computational power and others are dangerously dangling over the bleeding edge. Recently I wrote an article about how Intel has sped up the Python scientific libraries to provide better performance for very small projects (and significant, incremental speedups as they grow in the cloud). For the latter folks at the edge, there are some very interesting new toys to get excited about. And now (as of a few days ago) you can play with them in the Google Cloud Platform (GCP) for about $6.50 per hour ... with per second billing!

Google has been doing R&D with specialized hardware to accelerate and scale up some of the specific math operations that are central to TensorFlow. Enter the Tensor Processing Unit (TPU). The TPU is a piece of hardware, a roughly smartphone sized circuit board, built with custom Application Specific Integrated Circuits (ASICs) that have been designed to perform a few of the core and highly repetitive computational tasks that TensorFlow is architected around.

Image title

TPUs are following a similar evolutionary path to the GPU which has its roots as a separate processing board in the SGI graphics workstation during the 1980s. Prior to that time all of the computational work for rendering shaded three-dimensional images was done in the CPU. And as you can imagine this rapidly became a bottleneck. So the SGI solution was to offload the single most computationally intense and repetitive function: computing the RGB pixel values for individual triangular facets based on their defined color, virtual incident light sources, and the viewing perspective. The triangles are individual facets of a three-dimensional mesh representing the surface of a 3D object. Even way back in the 1980s it was not unusual to have tens of thousands of facets in a high-quality 3-D rendering. SGI made the decision to tackle the chore of rendering 3-D images with a piece of hardware they called a "geometry engine." The sole purpose of these initial "geometry engines" was to compute the shade of color to "paint" on each pixel of one tiny triangle. They resembled very small CPUs that could do dot and cross products on a small amount of isolated data (vectors for the three vertices of the triangle, vectors for the incident light, a vector for the viewing angle, vectors for the colors of the lights).

Image title

In fact, the computing cores of these early graphics oriented boards were often referred to as "shaders." And having a mere dozen of them was considered a very big deal. We have certainly come a long way since then. GPGPUs today have thousands of cores running at gigahertz rates and with gigabytes of onboard storage. Today it looks like machine learning is about to get a significant hardware boost of its own. 

Enter the TPU: Machine learning relies on a large number of simple operations upon relatively small sets of localized data (which of course are subsets of a vast corpus of input data. Google did not fail to recognize this divide and conquer nature of the TensorFlow methodology. Their approach, as you might expect, was to take a powerful modular approach to this computational bottleneck (opportunity?). True, this bottleneck is currently addressed by GPGPUs and they do a pretty good job. But, the GPGPU was advanced and refined to do a variety of graphics things. Google decided to refocus the problem to make hardware that did "machine learning things." In the interest of speed, they decided to remove the "General Purpose Graphics" from the General Purpose Graphics Processing Unit. And while GPUs strongly resemble CPUs, only smaller and highly parallelized, the TPU was created to make a more direct attack on the problem. It made sense to design the TPU to be less programmable by using ASICs which have their hardware programming built into the silicon. In their first incarnation of this new computing paradigm, the basic TPU module is based around four custom ASICs which jointly can accomplish 180 Tflops of floating-point math with up to 64 GB of high-bandwidth RAM on a moderately sized (fit in your hand) board.

Block diagram of the Google TPU:

Image title

I don't have the space here to describe the inner workings of the TPU but Google has provided a very readable explanation of what goes on in the block diagram above. It's well worth the 10 or 20 minutes you'll invest in pondering this device. Read it here.

Remember, this is just a beta. But, there is plenty of documentation including tutorials and QuickStart guides! And it's not overly complex to use. As an example simple multiplication and addition can be done by setting an environment variable:

  export TPU_NAME="demo-tpu"

And creating a short Python script:

import os
import tensorflow as tf
from tensorflow.contrib import tpu
from tensorflow.contrib.cluster_resolver import TPUClusterResolver

def axy_computation(a, x, y):
  return a * x + y

inputs = [
    tf.ones([3, 3], tf.float32),
    tf.ones([3, 3], tf.float32),

tpu_computation = tpu.rewrite(axy_computation, inputs)

tpu_grpc_url = TPUClusterResolver(

with tf.Session(tpu_grpc_url) as sess:
  output = sess.run(tpu_computation)


I think it looks like a solid bet for the direction of machine learning for the next year or two. And for about the price of a decent coffee just think what you can do with 180 Tflops for an hour! You can sign up right now.

machine learning ,tpu ,hardware ,ai

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}