Image for post
Image for post

CPU provides a generic set of instructions for general computing. To modify or optimize applications, we change the code but the hardware is fixed. Nevertheless, this generalization comes at the cost of complexity in hardware. Without complex hardware optimizations, like speculative execution, it hurts performance. However, these optimizations increase die size, and power consumption.

Image for post
Image for post
Generality provides flexibility at the cost of complexity

To increase concurrency in Deep Learning (DL), some chip designers limit the chip functionality to a vertical set of instructions and implement it with ASIC (Application-specific integrated circuit) design. That is the approach used by Google TPU. However, developing ASIC is expensive and not possible if the design requirement changes constantly.

FPGA provides a middle-ground approach between generic processors (like CPU) and ASIC. Designers can program an FPGA chip for their own hardware design and yet this can be changed or enhanced easily through FPGA reprogramming. Let’s have a technical overview first for those not familiar with FPGA.

What is FPGA? (Optional)

For the semiconductor industry, there is a middle ground solution for decades that allows configurable hardware design (generic components versus customized ASIC designs). That is the FPGA (Field Programmable Gate Arrays). It is called “field-programmable” because we can reprogram the chip easily for different hardware designs. We can view that FPGA contains blocks like Lego. By putting Lego together differently (programming), we create different toys (hardware design). Let’s look at a DL example in explaining the high-level application in AI.

Image for post
Image for post

Many DL models, like the fully-connected deep neural network (DNN) above, can be treated as computation graphs. The nodes represent computations and the edges indicate the data flow. To model this graph, our ASIC design should compose of computation nodes (blocks) that mimic the calculation in these nodes and then we link the data flow from one node to another.

FPGAs are semiconductor devices that are based around a matrix of configurable logic blocks (CLBs) connected via programmable interconnects. CLBs are highly configurable to create different logic. With Programmable Interconnect, we can create a complex data path for those CLBs.

Image for post
Image for post

CLB

The diagram below is the high-level block diagram for the adaptive logic modules (ALM) contained in the logic array block (LAB) in Intel Stratix 10 (LAB — a.k.a. configurable logic block CLB).

Image for post
Image for post
Source

CLB uses configurable look-up tables (LUT) (on the left above) to implement logical function f(a, b, c, …) and we can configure the LUT to mimic any logical functions.

Image for post
Image for post

Inside the CLB, the LUTs are often followed by adders with outputs registered (outputs store their previous values until a new clock is applied).

For illustration, this is another CLB example from the Xilinx 7 series FPGA. In the right diagram, the input side has two modules which take four inputs each and output a logical value. Then, further combinational logic is applied to create logical functions that support more than four inputs with multiple outputs.

Image for post
Image for post
Source

FPGA in DL

To further enhance the capability, other blocks like the memory block, multipliers, embedded processors, and DSP blocks can be added to FPGA. And these blocks can be segmented and connected through the vertical and horizontal lines below.

Image for post
Image for post

Here is the Intel Stratix 10 Variable Precision DSP Block. For DNN, it will optimize many of its arithmetic functions with these DSP blocks.

Image for post
Image for post
Source

The DSP blocks are also configurable to support multiple features.

Image for post
Image for post
Source (Intel)

FPGA selling points

There are many other blocks that FPGA may offer depending on vendors and product lines.

Image for post
Image for post
Source

An FPGA may contain million of logic elements, thousand of memory blocks, and thousand DSP blocks. These memory blocks can offer >50TB/s on-chip SRAM bandwidth.

Image for post
Image for post
Source Intel Stratix product line

This array of blocks can be grouped, segmented, and connected with programmable connections. These blocks and interconnects are highly configurable to create the parallelism and the computation power for highly customized designs. High-speed memory blocks can also be grouped into different sizes and serve specific size/cache needs for a specific set of nodes and features.

Image for post
Image for post
Source

In addition, FPGA is designed to receive and transmit signals fast. The emphasis of high-speed transceivers (with many I/O blocks) is another selling point in handling visual and audio streaming data. PGA can be reprogrammed in the 20 ms range with a new FPGA bitstream file (subject to FPGA model). Product upgrades are not restricted to software upgrades anymore. Hardware design upgrades can be done with a new bitstream file. This customized design usually consumes less power, 38~42W typically in Intel Vision Accelerator Design with Arria 10 FPGA versus the 250W range in Nvidia V100 GPU (this is just for illustration as both are very different devices). The system latency can also be shortened with the customized hardware. This is probably why Intel has pitched their FPGAs for AI inference.

Programmable Interconnects

These FPGA blocks can be connected together through vertical and horizontal lines.

Image for post
Image for post
Source: Wikipedia

But this is done through highly configurable Programmable Switch Matrices (PSM) that we can preprogram.

Image for post
Image for post
Source (Intel)

This is a zoomed view of possible interconnections for a small segment of the FPGA chip.

Image for post
Image for post
Source (Xilinx)

This is another demonstration of how blocks may be connected.

Image for post
Image for post
Source

Intel Vision Accelerator Design with Arria 10 FPGA (For AI inferencing)

Image for post
Image for post
Source: Intel

The Intel Vision Accelerator with Arria 10 FPGA can connect to a video server or NVR (network video recorders) to support more than 20 channels of video inputs for analysis, processing, or other AI applications including facial recognition and detection.

Image for post
Image for post

For example, we can use the accelerator to analyze raw video streams for person detection and tracking from many video sources (potential use in stores like Amazon Go).

Image for post
Image for post
Source: Intel

Here is the design for another FPGA acceleration card (Programmable Acceleration Card) using Arria 10 FPGA.

Image for post
Image for post
Source

These accelerator cards can perform many AI functions. OpenVINO (discussed later) releases many AI models in performing the following functions:

Image for post
Image for post
Source

Here is the high-level specification for the accelerator card and the Arria 10 if you are interested.

Image for post
Image for post
Source

For decades, FPGAs are programmed in high-level design language (HDL). It speaks the language of registers, clocks, etc…

Image for post
Image for post
Source Intel (Programming FPGA with HDL)

With the tools provided by the vendors, the source code in HDL is transformed into circuits. Then place and route algorithm is applied to map them into blocks with interconnections minimizing the latency. This compilation process will take hours or days to finish. Once it is done, a bitstream file is created. This file will be used to program (config) the FPGA.

However, this is not the way we design or deploy DNN models on the Intel FPGAs. Toolkits are used to perform a software deployment, instead of reprogramming the FPGA. The FPGA should be already programmed.

OpenVINO

OpenVINO (Open Visual Inference and Neural network Optimization) provide toolkits and libraries for Intel devices (CPU, integrated GPU, FPGA, etc…) in the following areas.

Image for post
Image for post
Source

It can be used to deploy a DL model to analyze a video stream using the FPGA.

Image for post
Image for post
Source: Intel

Intel’s distribution of OpenVINO contains:

  • Deep Learning Development Toolkit (DLDT) in optimizing and deploying ML models,
  • FPGA Deep Learning Acceleration (DLA) and bitstreams for DL,
  • Tools, libraries, and accelerators for OpenCV, OpenVX, OpenCL and Intel Media SDK,
  • Plugins for Intel devices, and
  • A model zoo with pre-trained models for DLDT.
Image for post
Image for post
Source

DL engineers design and train models in software platforms like TensorFlow, PyTorch, etc… Ultimately, we save trained models in files (.pb in TensorFlow).

Image for post
Image for post
Source

Then, we use the Deep Learning Deployment Toolkit (DLDT) in OpenVINO to deploy the DL model. DLDT contains two major components

  • Model optimizer and
  • Inference engine.

In specific, DLDT reads the model files, performs optimization, and deploys it on Intel devices. To have a single platform solution for all Intel devices, it reads the trained model file from the popular DL platform (like TensorFlow), applies device-agnostic optimization, and converts the model into a device-agnostic intermediate representation (IR).

The following two sections are optional in demonstrating some computation graph optimizations.

Linear Operations Fusing (advance topic-optional)

Many CNN models, include ResNet and Inception, contain batch normalization and scale shift layers (scale and then offset the input) that can be viewed as sequences of add and multiplication. These sequences can be merged into a single multiply and add operation, i.e. ×, +, ×, +, ×, + → ×, +. Then it can be fused to the next convolution layer or fully connected layer if present (details).

Image for post
Image for post
ResNet269 block before and after optimization

Stride optimization in ResNet (advance topic-optional)

The main idea here is to move the stride that is greater than 1 from later convolution layers (with the kernel size = 1) to upper Convolution layers. In addition, a pooling layer is added for the skip connection.

Image for post
Image for post
Stride optimization

Other optimizations include node merging, horizontal fusion, and dropping unused layers (dropout). TensorRT is the deployment tool used by Nvidia on GPUs. For those interested in this optimization topic, it has documented a larger list of optimizations used on computation graphs.

Intermediate representation Quantization

If requested by developers, further quantization can be applied to the weights to use lower precision arithmetic like INT8. This reduces memory bandwidth needs and speeds up operations.

Intel FPGA Deep Learning Acceleration (DLA) Suite

Image for post
Image for post
Source

The Inference Engine in Deep Learning Deployment Toolkit (DLDT) provides a high-level device-agnostic API for developers to program the inference. Here is some sample code.

Image for post
Image for post
Source

The Inference Engine loads the user-supplied IR and invokes the corresponding plugin to process the inference for the specific device.

Image for post
Image for post
Source

For FPGA, it invokes the DLA (Deep Learning Acceleration) Runtime Engine.

Image for post
Image for post
Source

which drives the DL model execution in the accelerator.

Image for post
Image for post
Source

Deploying a DNN model is a software process. The FPGA is already pre-programmed with a bitstream designed for DLA in running DL models. No FPGA compile is needed.

Image for post
Image for post
Source

Here is the DLA architecture that DLA Runtime used to run a DL model. This architecture contains convolution PE (processing element) array, caches for storing feature maps and layers (components) commonly used in DL.

Image for post
Image for post
Source

Let’s map a DNN model into this acceleration engine architecture. Many DL models, like AlexNet, contain groups of highly similar sequences of layers, for example, a convolution layer followed by ReLU, normalization, and max pooling.

Image for post
Image for post
Source (Groups of layers to be processed in the pipeline concept)

Inside an FPGA, a DL layer is implemented blocks linked with the configured interconnects.

Image for post
Image for post
Source

To run a group of layers, we create a data stream and pass it through the blocks responsible for the specific types of DL layer. To execute the whole model, we repeat the streaming cycle to process the next group until all DNN layers are processed.

Image for post
Image for post
Source

These blocks are highly reconfigurable in runtime and bypassable. This allows different design parameters of the DL layer (like CNN stride) or skips layer(s) that are not needed.

Image for post
Image for post
Source (Process the 3rd and 2nd last layers)

Let’s elaborate it further. First, video data arrives from the DDR (double data rate) channel.

Image for post
Image for post
Source

If the video data is too large to be stored in the on-chip stream buffer, it is sliced and pass it one-by-one in multiple iterations of the pipeline. In each iteration, data is pulled from the buffer and processing by the convolution PE Array (PE — processing element) and the activation block. It is followed by other blocks like normalization and max-pooling through a crossbar (XBAR). The data is then feedback to the stream buffer for the next group of layers. Once the whole model is processed, it is written back to the memory and continue with the next slice of data. The following diagram summarizes the graph loop architecture used by the Deep Learning Accelerator (DAL) Engine to execute the DL model.

Image for post
Image for post
(Source)

Inference Engine flow

Let also summarize the flow of the Inference Engine (IE) calls to an FPGA device. Developers make calls to the IE common APIs for inference. IE invokes the FPGA Plugin. This invokes the DLA (Intel Deep Learning Accelerator) which runs the OpenCL runtime. This eventually sends to the DLA FPGA IP that implements the primitives, like convolution, ReLU, etc …

(Terms: FPGA IP refers to hardened Intellectual Property cores. In our context here, we can treat it as the physical blocks that implement the primitives.)

Image for post
Image for post
Source

Bitstream

The Deep Learning Deployment Toolkit (DLDT) ships with many bitstreams for various boards, data types used in inference, and DL models.

Image for post
Image for post
Naming conventions for the bitstream file

OpenVINO ships with bitstreams that specialize in the following DL models using the graph loop architecture.

Image for post
Image for post
Source

These are the supported primitives.

Image for post
Image for post
Source

This execution model can be expanded when new types of DNN layers come out. The FPGA can be reprogrammed using a bitstream file with added new kernels (primitives) attached to the Xbar. This is the appeal of FPGA that we can change the hardware design without replacing the hardware.

Image for post
Image for post
Source

Here is the architecture with a new primitive added.

Image for post
Image for post

To add such features/layers to DLA, the IP architect (the second row below) modify the DLA suite source code and then re-compile an FPGA bitstream to be programmed on the FPGA.

Image for post
Image for post
Source

For your reference, the diagram below is the typical workflow to create a custom kernel (custom primitive) by the IP architect.

Image for post
Image for post
Source

Design tradeoff

As shown before, different DL models (topologies) may lead to different choices of the bitstream. For example, with a fixed hardware resource, added primitives may need to shrink the PE array size.

Image for post
Image for post
Source

Or it will impact your choice of FPGA produce line to favor larger feature map cache over PE array (details). Note: These tradeoffs are extremely sensitive to DL models and applications. As some applications may need a bigger feature map cache to reduce off-chip memory access.

Image for post
Image for post
Source

Optimization

FPGA flexibility in hardware design permits optimization that cannot be done otherwise. In this section, we explore some of these optimizations in increasing concurrency and reducing operations. Without the assistance of hardware, these optimizations may not be possible or do not have the same order of performance gains.

Parallel Execution of Convolution

To increase the parallelism in processing the convolution layer, filters are divided and processed in parallel by different convolution PEs. In addition, the feature map values at the same spatial location can be multiplied with the filters using vector operations to increase concurrency.

Image for post
Image for post
Source

Winograd Transformation

Winograd Transformation reduces the number of multiplications needed in convolution operations. To demonstrate the idea, let’s have an example in which we apply a 3 × 3 filter on a 4 × 4 image (stride equals 1 and zero padding). To compute the convolution, we slide the filter window one pixel at a time and perform a piece-wise multiplication with the underneath 3 × 3 image. The convolution result will be a 2 × 2 matrix. Traditionally, we will perform four independent calculations (shown as below) and each with 9 multiplications. This results in a total of 36 multiplications.

Image for post
Image for post

Let’s take a look at the efficiency of this method. For those four independent calculations, it contains ²/₃ overlapping image data. But if we compute them independently, we are not taking advantage of this significant information. Take a look from another angle. If data is 100% overlap, we can perform one calculation instead of four independent calculations.

How to take advantage of this information does not seem obvious. But Winograd solved a similar problem more than 40 years ago. Let’s say we have two pre-process methods that transform the 4 × 4 data and the 3 × 3 filter into two 4 × 4 matrices separately (the first two rows in the left below).

Image for post
Image for post

Then we apply convolution to these two transformed matrices. Since both have the same size, the convolution takes 16 multiplications only. Finally, we perform a post-process transformation that converts it back to a 2 × 2 matrix. It turns out such pre-process and post-process exists in producing the same convolution results. The better news is these transformation involves simple add and minus operations but no multiplication. So the new Winograd Transformation can save a lot of computation.

Let’s demonstrate the idea with 1-D data and filter.

Image for post
Image for post
Source

The original convolution method for data ⊗ filter involves 6 multiplication. We will merge the pre-processing, convolution, post-processing steps here and shows that it can be derived from calculating m₁, m₂, m₃, and m₄ above. With this Winograd Transformation, we need 4 multiplications only.

Just for completeness, here are the mathematical transformations needed. The pre-process transformation Gg, Bᵀ and the post-process transformation Aᵀ can be done with simple arithmetic operations.

Image for post
Image for post
Source

In FPGA, we can apply the Winograd transform in hardware to speed up the convolution.

Image for post
Image for post

Parallel Execution in Fully Connected Layer with a batch of data

In a fully-connected network, to increase the concurrency and to decrease the loading of the weights, we can process a batch of images (from multiple video channels) concurrently.

Image for post
Image for post
Source (Right diagram: Handle a batch of images to increase concurrency and decrease the loading of the weights)

Feature Cache

In our streaming process, we use double buffers for the input and the result. For the next loop, we simply switch the use of these buffers (use the input buffer for output and vice versa) This avoids the need to save data into the off-chip memory.

Image for post
Image for post
Source

Filter cache

We can also use a double buffer in which one stores the weights for the current convolution while the other is for pre-fetching for the next convolution to increase concurrency.

Image for post
Image for post
Source

Lower Precision

As a general trend in the AI hardware design, vendors are exploring the use of low precision data with the same range cover in inferencing, for example, FP11 below will have the same range of FP16 but a lower precision because of the smaller mantissa. The datatype used for inferencing in FPGA is configurable and FPGA provides a lot of flexibility in creating arithmetic circuits of different data sizes.

Image for post
Image for post

Intel Stratix 10 NX FPGA (For AI inferencing)

Image for post
Image for post
Source

Intel Stratix 10 NX FPGA are particular designed for AI with AI Tensor Blocks. These blocks contain dense arrays of lower-precision multipliers tuned for matrix and vector multiplications with INT4, INT8, Block FP12, or Block FP16 operations. In addition, these tensor blocks can be cascaded together to support large matrices.

Image for post
Image for post
Source (Tensor block)

The AI Tensor Block contains 30 multipliers and 30 accumulators instead of the two in the DSP block. This FPGA also includes integrated HBM2 memory and high-speed transceivers.

Intel Agilex FPGA

Agilex FPGA is a new FPGA line for Intel. It introduces BF16 which becomes popular in many AI-centric chips.

Image for post
Image for post
Source: Google

This is some high-level specification.

Image for post
Image for post
Source

Xilinx Versal

Xilinx Versal is an Adaptive Compute Acceleration Platform (ACAP). ACAP is a heterogeneous compute platform that combines Scalar Engines, Adaptable Engines (a.k.a. CLB), and AI Engines. All these engines are interconnected with the network-on-chip (NoC) in achieving multi-terabit communication.

Image for post
Image for post
Source

The AI Engines contains an array of VLIW/SIMD vector cores with tightly coupled local memory. Like FPGAs, it is highly configurable for specialized hardware design and it is targeted for DL inference.

Image for post
Image for post
Xilinx Versal

Versal Details

Here is another view of the logic blocks in Versal.

Image for post
Image for post
Xilinx Versal ACAP Functional Diagram

It contains AI engines that are composed of an array of SIMD cores. Here is the detailed view of a single tile in the array.

Image for post
Image for post
AI Engine Tile

Vitis AI provides a software platform to deploy and optimize the common DL platform’s models onto Xilinx devices. It performs model pruning, quantization, model optimization, and instruction generation. It also provides an AI profiler in monitoring the efficiency and utilization of AI inference implementation.

Image for post
Image for post
Vitis AI

For simplicity, we will not discuss the Xilinx devices further and we encourage readers to visit the Xilinx website for more details.

Credit & References

What Is An FPGA?

Basics of Programmable Logic: FPGA Architecture

What is OpenVINO?

Introducing the Intel Vision Accelerator Design with Intel® Arria® 10 FPGA

Deep Learning Inference With Intel FPGAs

OpenVINO optimization guide

FPGAs for Deep Learning

Configuration Guide for OpenVINO™ toolkit and the Vision Accelerator Design with an Intel Arria 10 FPGA

Intel FPGA SDK for OpenCL Pro Edition Best Practices Guide

Flexibility: FPGAs and CAD in Deep Learning Acceleration

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store