AI chips: FPGA

17 min readNov 30, 2020

CPU provides a generic set of instructions for general computing. To modify or optimize applications, we change the code but the hardware is fixed. Nevertheless, this generalization comes at the cost of complexity in hardware. Without complex hardware optimizations, like speculative execution, it hurts performance. However, these optimizations increase die size, and power consumption.

Generality provides flexibility at the cost of complexity

To increase concurrency in Deep Learning (DL), some chip designers limit the chip functionality to a vertical set of instructions and implement it with ASIC (Application-specific integrated circuit) design. That is the approach used by Google TPU. However, developing ASIC is expensive and not possible if the design requirement changes constantly.

FPGA provides a middle-ground approach between generic processors (like CPU) and ASIC. Designers can program an FPGA chip for their own hardware design and yet this can be changed or enhanced easily through FPGA reprogramming. Let’s have a technical overview first for those not familiar with FPGA.

What is FPGA? (Optional)

For the semiconductor industry, there is a middle ground solution for decades that allows configurable hardware design (generic components versus customized ASIC designs). That is the FPGA (Field Programmable Gate Arrays). It is called “field-programmable” because we can reprogram the chip easily for different hardware designs. We can view that FPGA contains blocks like Lego. By putting Lego together differently (programming), we create different toys (hardware design). Let’s look at a DL example in explaining the high-level application in AI.

Many DL models, like the fully-connected deep neural network (DNN) above, can be treated as computation graphs. The nodes represent computations and the edges indicate the data flow. To model this graph, our ASIC design should compose of computation nodes (blocks) that mimic the calculation in these nodes and then we link the data flow from one node to another.

FPGAs are semiconductor devices that are based around a matrix of configurable logic blocks (CLBs) connected via programmable interconnects. CLBs are highly configurable to create different logic. With Programmable Interconnect, we can create a complex data path for those CLBs.

CLB

The diagram below is the high-level block diagram for the adaptive logic modules (ALM) contained in the logic array block (LAB) in Intel Stratix 10 (LAB — a.k.a. configurable logic block CLB).

CLB uses configurable look-up tables (LUT) (on the left above) to implement logical function f(a, b, c, …) and we can configure the LUT to mimic any logical functions.

Inside the CLB, the LUTs are often followed by adders with outputs registered (outputs store their previous values until a new clock is applied).

For illustration, this is another CLB example from the Xilinx 7 series FPGA. In the right diagram, the input side has two modules which take four inputs each and output a logical value. Then, further combinational logic is applied to create logical functions that support more than four inputs with multiple outputs.

FPGA in DL

To further enhance the capability, other blocks like the memory block, multipliers, embedded processors, and DSP blocks can be added to FPGA. And these blocks can be segmented and connected through the vertical and horizontal lines below.

Here is the Intel Stratix 10 Variable Precision DSP Block. For DNN, it will optimize many of its arithmetic functions with these DSP blocks.

The DSP blocks are also configurable to support multiple features.

FPGA selling points

There are many other blocks that FPGA may offer depending on vendors and product lines.

An FPGA may contain million of logic elements, thousand of memory blocks, and thousand DSP blocks. These memory blocks can offer >50TB/s on-chip SRAM bandwidth.

This array of blocks can be grouped, segmented, and connected with programmable connections. These blocks and interconnects are highly configurable to create the parallelism and the computation power for highly customized designs. High-speed memory blocks can also be grouped into different sizes and serve specific size/cache needs for a specific set of nodes and features.

In addition, FPGA is designed to receive and transmit signals fast. The emphasis of high-speed transceivers (with many I/O blocks) is another selling point in handling visual and audio streaming data. PGA can be reprogrammed in the 20 ms range with a new FPGA bitstream file (subject to FPGA model). Product upgrades are not restricted to software upgrades anymore. Hardware design upgrades can be done with a new bitstream file. This customized design usually consumes less power, 38~42W typically in Intel Vision Accelerator Design with Arria 10 FPGA versus the 250W range in Nvidia V100 GPU (this is just for illustration as both are very different devices). The system latency can also be shortened with the customized hardware. This is probably why Intel has pitched their FPGAs for AI inference.

Programmable Interconnects

These FPGA blocks can be connected together through vertical and horizontal lines.

But this is done through highly configurable Programmable Switch Matrices (PSM) that we can preprogram.

This is a zoomed view of possible interconnections for a small segment of the FPGA chip.

This is another demonstration of how blocks may be connected.

Intel Vision Accelerator Design with Arria 10 FPGA (For AI inferencing)

The Intel Vision Accelerator with Arria 10 FPGA can connect to a video server or NVR (network video recorders) to support more than 20 channels of video inputs for analysis, processing, or other AI applications including facial recognition and detection.

For example, we can use the accelerator to analyze raw video streams for person detection and tracking from many video sources (potential use in stores like Amazon Go).

Here is the design for another FPGA acceleration card (Programmable Acceleration Card) using Arria 10 FPGA.

These accelerator cards can perform many AI functions. OpenVINO (discussed later) releases many AI models in performing the following functions:

Here is the high-level specification for the accelerator card and the Arria 10 if you are interested.

For decades, FPGAs are programmed in high-level design language (HDL). It speaks the language of registers, clocks, etc…

Source Intel (Programming FPGA with HDL)

With the tools provided by the vendors, the source code in HDL is transformed into circuits. Then place and route algorithm is applied to map them into blocks with interconnections minimizing the latency. This compilation process will take hours or days to finish. Once it is done, a bitstream file is created. This file will be used to program (config) the FPGA.

However, this is not the way we design or deploy DNN models on the Intel FPGAs. Toolkits are used to perform a software deployment, instead of reprogramming the FPGA. The FPGA should be already programmed.

OpenVINO

OpenVINO (Open Visual Inference and Neural network Optimization) provide toolkits and libraries for Intel devices (CPU, integrated GPU, FPGA, etc…) in the following areas.

It can be used to deploy a DL model to analyze a video stream using the FPGA.

Intel’s distribution of OpenVINO contains:

Deep Learning Development Toolkit (DLDT) in optimizing and deploying ML models,
FPGA Deep Learning Acceleration (DLA) and bitstreams for DL,
Tools, libraries, and accelerators for OpenCV, OpenVX, OpenCL and Intel Media SDK,
Plugins for Intel devices, and
A model zoo with pre-trained models for DLDT.

DL engineers design and train models in software platforms like TensorFlow, PyTorch, etc… Ultimately, we save trained models in files (.pb in TensorFlow).

Then, we use the Deep Learning Deployment Toolkit (DLDT) in OpenVINO to deploy the DL model. DLDT contains two major components

Model optimizer and
Inference engine.

In specific, DLDT reads the model files, performs optimization, and deploys it on Intel devices. To have a single platform solution for all Intel devices, it reads the trained model file from the popular DL platform (like TensorFlow), applies device-agnostic optimization, and converts the model into a device-agnostic intermediate representation (IR).

The following two sections are optional in demonstrating some computation graph optimizations.

Linear Operations Fusing (advance topic-optional)

Many CNN models, include ResNet and Inception, contain batch normalization and scale shift layers (scale and then offset the input) that can be viewed as sequences of add and multiplication. These sequences can be merged into a single multiply and add operation, i.e. ×, +, ×, +, ×, + → ×, +. Then it can be fused to the next convolution layer or fully connected layer if present (details).

**ResNet269 block before and after optimization**

Stride optimization in ResNet (advance topic-optional)

The main idea here is to move the stride that is greater than 1 from later convolution layers (with the kernel size = 1) to upper Convolution layers. In addition, a pooling layer is added for the skip connection.

Other optimizations include node merging, horizontal fusion, and dropping unused layers (dropout). TensorRT is the deployment tool used by Nvidia on GPUs. For those interested in this optimization topic, it has documented a larger list of optimizations used on computation graphs.

Intermediate representation Quantization

If requested by developers, further quantization can be applied to the weights to use lower precision arithmetic like INT8. This reduces memory bandwidth needs and speeds up operations.

Intel FPGA Deep Learning Acceleration (DLA) Suite

The Inference Engine in Deep Learning Deployment Toolkit (DLDT) provides a high-level device-agnostic API for developers to program the inference. Here is some sample code.

The Inference Engine loads the user-supplied IR and invokes the corresponding plugin to process the inference for the specific device.

For FPGA, it invokes the DLA (Deep Learning Acceleration) Runtime Engine.

which drives the DL model execution in the accelerator.

Deploying a DNN model is a software process. The FPGA is already pre-programmed with a bitstream designed for DLA in running DL models. No FPGA compile is needed.

Here is the DLA architecture that DLA Runtime used to run a DL model. This architecture contains convolution PE (processing element) array, caches for storing feature maps and layers (components) commonly used in DL.

Let’s map a DNN model into this acceleration engine architecture. Many DL models, like AlexNet, contain groups of highly similar sequences of layers, for example, a convolution layer followed by ReLU, normalization, and max pooling.

Source (Groups of layers to be processed in the pipeline concept)

Inside an FPGA, a DL layer is implemented blocks linked with the configured interconnects.

To run a group of layers, we create a data stream and pass it through the blocks responsible for the specific types of DL layer. To execute the whole model, we repeat the streaming cycle to process the next group until all DNN layers are processed.

These blocks are highly reconfigurable in runtime and bypassable. This allows different design parameters of the DL layer (like CNN stride) or skips layer(s) that are not needed.

Source (Process the 3rd and 2nd last layers)

Let’s elaborate it further. First, video data arrives from the DDR (double data rate) channel.

If the video data is too large to be stored in the on-chip stream buffer, it is sliced and pass it one-by-one in multiple iterations of the pipeline. In each iteration, data is pulled from the buffer and processing by the convolution PE Array (PE — processing element) and the activation block. It is followed by other blocks like normalization and max-pooling through a crossbar (XBAR). The data is then feedback to the stream buffer for the next group of layers. Once the whole model is processed, it is written back to the memory and continue with the next slice of data. The following diagram summarizes the graph loop architecture used by the Deep Learning Accelerator (DAL) Engine to execute the DL model.

Inference Engine flow

Let also summarize the flow of the Inference Engine (IE) calls to an FPGA device. Developers make calls to the IE common APIs for inference. IE invokes the FPGA Plugin. This invokes the DLA (Intel Deep Learning Accelerator) which runs the OpenCL runtime. This eventually sends to the DLA FPGA IP that implements the primitives, like convolution, ReLU, etc …

(Terms: FPGA IP refers to hardened Intellectual Property cores. In our context here, we can treat it as the physical blocks that implement the primitives.)

Bitstream

The Deep Learning Deployment Toolkit (DLDT) ships with many bitstreams for various boards, data types used in inference, and DL models.

OpenVINO ships with bitstreams that specialize in the following DL models using the graph loop architecture.

These are the supported primitives.

This execution model can be expanded when new types of DNN layers come out. The FPGA can be reprogrammed using a bitstream file with added new kernels (primitives) attached to the Xbar. This is the appeal of FPGA that we can change the hardware design without replacing the hardware.

Here is the architecture with a new primitive added.

To add such features/layers to DLA, the IP architect (the second row below) modify the DLA suite source code and then re-compile an FPGA bitstream to be programmed on the FPGA.

For your reference, the diagram below is the typical workflow to create a custom kernel (custom primitive) by the IP architect.

Design tradeoff

As shown before, different DL models (topologies) may lead to different choices of the bitstream. For example, with a fixed hardware resource, added primitives may need to shrink the PE array size.

Or it will impact your choice of FPGA produce line to favor larger feature map cache over PE array (details). Note: These tradeoffs are extremely sensitive to DL models and applications. As some applications may need a bigger feature map cache to reduce off-chip memory access.

Optimization

FPGA flexibility in hardware design permits optimization that cannot be done otherwise. In this section, we explore some of these optimizations in increasing concurrency and reducing operations. Without the assistance of hardware, these optimizations may not be possible or do not have the same order of performance gains.

Parallel Execution of Convolution

To increase the parallelism in processing the convolution layer, filters are divided and processed in parallel by different convolution PEs. In addition, the feature map values at the same spatial location can be multiplied with the filters using vector operations to increase concurrency.

Winograd Transformation

Winograd Transformation reduces the number of multiplications needed in convolution operations. To demonstrate the idea, let’s have an example in which we apply a 3 × 3 filter on a 4 × 4 image (stride equals 1 and zero padding). To compute the convolution, we slide the filter window one pixel at a time and perform a piece-wise multiplication with the underneath 3 × 3 image. The convolution result will be a 2 × 2 matrix. Traditionally, we will perform four independent calculations (shown as below) and each with 9 multiplications. This results in a total of 36 multiplications.

Let’s take a look at the efficiency of this method. For those four independent calculations, it contains ²/₃ overlapping image data. But if we compute them independently, we are not taking advantage of this significant information. Take a look from another angle. If data is 100% overlap, we can perform one calculation instead of four independent calculations.

How to take advantage of this information does not seem obvious. But Winograd solved a similar problem more than 40 years ago. Let’s say we have two pre-process methods that transform the 4 × 4 data and the 3 × 3 filter into two 4 × 4 matrices separately (the first two rows in the left below).

Then we apply convolution to these two transformed matrices. Since both have the same size, the convolution takes 16 multiplications only. Finally, we perform a post-process transformation that converts it back to a 2 × 2 matrix. It turns out such pre-process and post-process exists in producing the same convolution results. The better news is these transformation involves simple add and minus operations but no multiplication. So the new Winograd Transformation can save a lot of computation.

Let’s demonstrate the idea with 1-D data and filter.

The original convolution method for data ⊗ filter involves 6 multiplication. We will merge the pre-processing, convolution, post-processing steps here and shows that it can be derived from calculating m₁, m₂, m₃, and m₄ above. With this Winograd Transformation, we need 4 multiplications only.

Just for completeness, here are the mathematical transformations needed. The pre-process transformation Gg, Bᵀ and the post-process transformation Aᵀ can be done with simple arithmetic operations.

In FPGA, we can apply the Winograd transform in hardware to speed up the convolution.

Parallel Execution in Fully Connected Layer with a batch of data

In a fully-connected network, to increase the concurrency and to decrease the loading of the weights, we can process a batch of images (from multiple video channels) concurrently.

Source (Right diagram: Handle a batch of images to increase concurrency and decrease the loading of the weights)

Feature Cache

In our streaming process, we use double buffers for the input and the result. For the next loop, we simply switch the use of these buffers (use the input buffer for output and vice versa) This avoids the need to save data into the off-chip memory.

Filter cache

We can also use a double buffer in which one stores the weights for the current convolution while the other is for pre-fetching for the next convolution to increase concurrency.

Lower Precision

As a general trend in the AI hardware design, vendors are exploring the use of low precision data with the same range cover in inferencing, for example, FP11 below will have the same range of FP16 but a lower precision because of the smaller mantissa. The datatype used for inferencing in FPGA is configurable and FPGA provides a lot of flexibility in creating arithmetic circuits of different data sizes.

Intel Stratix 10 NX FPGA (For AI inferencing)

Intel Stratix 10 NX FPGA are particular designed for AI with AI Tensor Blocks. These blocks contain dense arrays of lower-precision multipliers tuned for matrix and vector multiplications with INT4, INT8, Block FP12, or Block FP16 operations. In addition, these tensor blocks can be cascaded together to support large matrices.

The AI Tensor Block contains 30 multipliers and 30 accumulators instead of the two in the DSP block. This FPGA also includes integrated HBM2 memory and high-speed transceivers.

Intel Agilex FPGA

Agilex FPGA is a new FPGA line for Intel. It introduces BF16 which becomes popular in many AI-centric chips.

This is some high-level specification.

Xilinx Versal

Xilinx Versal is an Adaptive Compute Acceleration Platform (ACAP). ACAP is a heterogeneous compute platform that combines Scalar Engines, Adaptable Engines (a.k.a. CLB), and AI Engines. All these engines are interconnected with the network-on-chip (NoC) in achieving multi-terabit communication.

The AI Engines contains an array of VLIW/SIMD vector cores with tightly coupled local memory. Like FPGAs, it is highly configurable for specialized hardware design and it is targeted for DL inference.

Versal Details

Here is another view of the logic blocks in Versal.

It contains AI engines that are composed of an array of SIMD cores. Here is the detailed view of a single tile in the array.

Vitis AI provides a software platform to deploy and optimize the common DL platform’s models onto Xilinx devices. It performs model pruning, quantization, model optimization, and instruction generation. It also provides an AI profiler in monitoring the efficiency and utilization of AI inference implementation.