Google’s chip designers argue that if Moore’s Law is no longer sustainable, domain-specific architectures are the way for the future. Instead of developing generic hardware for ML (machine learning), Google specializes Tensor Processing Unit (TPU) to be an ASIC (application-specific integrated circuit) accelerator for AI. The TPU objective is optimizing operations matter most to its problem domain — on top of the list is the deep neural network (DNN). Here are some other target DL domains that targeted by TPU:
TPU introduces 128×128 16-bit matrix multiply units (MXU) for matrix multiplication to accelerate ML. PageRank used in Google search ranking involves huge matrix multiplication. Therefore, Google has been utilizing TPU for many operations within Google that involve matrix multiplications heavily, including the inferencing in AlphaGo and PageRank.
With a much-focused problem domain, many capabilities become non-critical and are eliminated for simplicity. Nvidia GPU eliminates branch prediction and speculative execution in CPU to free up valued die space. TPU simplifies the instruction handling further and depends on the host server to send TPU instructions for it to execute (rather than fetching them itself). For quick development and reducing project risk, Google engineers designed the TPU to be a coprocessor on the I/O bus and plug into existing servers just like a GPU card.
The diagram below is the TPU v3 device. The v3 version contains 4 chips with each chip contain 2 cores and a total of 32GB HBM memory. Each core contains vector processing units (VPU), scalar units, and two 128 × 128 matrix multiply units (MXUs) (TPUv1 is 256×256).
Each MXU performs 16K multiply-accumulate operations in each cycle using BF16 precision (16 bits with value cover the range of FP32 but in lower precision) for the internal multiplication and FP32 for accumulation.
ML codes are usually written in FP32 but the conversion to BF16 will be done automatically by the TPU. MXU will execute the fully-connected layers and the convolution layers in the DNN. The VPU runs vector operations including ReLU, activation functions, softmax, batch normalization, dropout, pooling, and gradient updates using INT32 and FP32. The scalar unit performs TPU control, data flow, and memory addressing.
The MXU operation is done in a pipeline concept with the systolic arrays. Here is a high-level animation of the systolic array operation.
Here is another closer look at the TPU device with 4 chips.
TPU devices are not available for public purchase. But the general public can access the TPU through Google cloud. The Google cloud provides different configuration options (like v2–8, v3–8 which 8 is the number of cores) for a single TPU device.
Multiple TPU devices can also be configured with high-speed connections to form a TPU pod. A v3 pod can have a maximum of 256 devices with 2048 TPU v3 cores and 32 TiB of memory.
Many production quality Deep Learning (DL) models are trained with a cluster of devices. With data parallelism, each node is responsible for a subset of data. To scale linearly, superfast inter-node communication is a must and TPU devices are connected with a 2-D toroidal mesh network.
This high-speed network broadcasts parameter (weight) updates across all TPU device in the All Reduce operation — all workers train over a subset of input data and aggregating gradients at each step.
The diagram on the right below is a more advanced weight update method.
It distributes the weight update computation across TPU-v3 cores and then use an optimized all-gather to broadcast the new weights to all the TPU-v3 cores.
TPU implements the matrix multiplication with the systolic array in a pipeline fashion.
First, load the weights w of the model into the array. The activation x will be skewed and slid down one row at a time. We initialize a partial sum for each cell in the array to be zero. We multiply the cell’s weight with the slide-down value of x and add that to the corresponding partial sum. Then we shift all the partial sums one cell to the right. We repeat the steps until all the values are computed. The diagram below demonstrates the idea visually.
As shown, TPU processes matrix multiplication as a data stream with no need for off-chip DRAM storage. By knowing the data access pattern of the target domain problem, we can use local memory for storing intermediate calculations. This avoids the need for complex caching in optimizing off-chip memory accesses. This is a good example that if we can utilize the domain knowledge better, we can drop many optimizations like caching that increase die size.
TPU system design
TPU runs the whole inference model to reduce interactions with the CPU. The following is the system diagram for the TPU device.
The description below is quoted from Norman Jouppi on TPUv1.
The TPU instructions are sent from the host over the peripheral component interconnect express (PCIe) Gen3 x16 bus into an instruction buffer. The internal blocks are typically connected together by 256-byte-wide paths. Starting in the upper-right corner, the matrix multiply unit is the heart of the TPU, with 256×256 MACs that can perform eight-bit multiply-and-adds on signed or unsigned integers. The 16-bit products are collected in the four megabytes of 32-bit Accumulators below the matrix unit. The four MiB represents 4,096, 256-element, 32-bit accumulators. The matrix unit produces one 256-element partial sum per cycle.
The weights for the matrix unit are staged through an on-chip “Weight FIFO” that reads from an off-chip eight-gigabyte DRAM we call “weight memory”; for inference, weights are read-only; eight gigabytes supports many simultaneously active models. The weight FIFO is four tiles deep. The intermediate results are held in the 24 MiB on-chip “unified buffer” that can serve as inputs to the Matrix Unit. A programmable DMA controller transfers data to or from CPU Host memory and the Unified Buffer. To be able to deploy dependably at Google scale, internal and external memory include built-in error-detection-and-correction hardware.
As shown below, the majority of the space in the die is responsible for arithmetic calculation in TPU. This is not the case for CPU. This type of analysis is often used by researchers to illustrate the overhead of a design.
Some more detailed hardware design decisions can be found in section 2 here (It will be too detail for our discussion here).
The next but not yet release version for TPU is v4. For the MLPerf v0.7 results, Google benchmarked both the v3 and v4 TPU. According to Google,
Google’s fourth-generation TPU ASIC offers more than double the matrix multiplication TFLOPs of TPU v3, a significant boost in memory bandwidth, and advances in interconnect technology. Google’s TPU v4 MLPerf submissions take advantage of these new hardware features with complementary compiler and modeling advances. The results demonstrate an average improvement of 2.7 times over TPU v3 performance at a similar scale in the last MLPerf Training competition.
TPU v4 double the matrix multiplication TFLOPs and offers a significant boost in memory bandwidth from new interconnect technology.
Google runs with certain risks if DL algorithms change. But the cost-saving for Google datacenter can be huge and may not matter even in the short run. But later in this series, we will look into startups that disagree with this direction.
Here is an article on GPU if you are interested in other approaches.