AI Edge Chips: Nvidia Jetson Xavier NX, AGX Xavier, Google Coral Edge TPU & Startups
Deep Learning (DL) inference is often done in the cloud to utilize a widely available and flexible infrastructure. Nevertheless, as AI is gradually embraced by embedding systems, more AI edge chips will be adopted if power consumption, latency, or connectivity become more dominant design factors. While general AI chips emphasize instruction and data throughput, the edge chips will be preferable to be small, low power consumption (many in a few watts to 20/30W), and low latency.
Edge Device Applications
The early adopter for AI edge chips in the consumer market will likely be computer vision in the area of object detection and classification, pose estimation, gaze detection, and image segmentation. NLP applications without the cloud are catching up like the always-on voice processors for simpler command processing.
There are other high-end applications that include autonomous machines like robotics.
It is also popular in video analysis like traffic management at the edge in reducing the massive bandwidth loads to the cloud servers.
Nvidia Jetson AGX Xavier
Jetson AGX Xavier is the high-end embedded system-on-module (SoM) that Nvidia offers with potential applications include autonomous machines.
In this article, we will present some high-level specifications and designs for different chips. Many of them are for reference and quick comparison. Please feel free to browse through it quickly according to your interest level.
The module composes of an integrated Xavier SoC (System-on-Chip — a.k.a. a computer system on a single chip):
The SoC mainly contains
- 8 core Carmel ARM 64-bit CPU,
- 512 core Volta GPU with 64 Tensor Cores,
- dual Deep Learning Accelerator (DLA),
- two Vision Accelerator (VA) engine,
- HD video codecs,
- PCIe Gen 4, and
- 16 camera lanes of MIPI CSI-2 (128Gbps) — up to six cameras.
It has a memory bandwidth of 137GB/s with 16GB 256-bit LPDDR4x memory (LPDDR: low-power double data rate synchronous DRAM). It has a 32GB eMMC storage (embedded storage consists of NAND flash memory) and 750Gbps of high-speed I/O. The DLA engines (dual Deep Learning Accelerator) offload the inferencing of Deep Neural Networks (DNNs) from ARM. The system runs on Linux with 32 TOPS in configurable 10/15/30W power profiles. So let’s look at each component a little bit more.
The CPU consists of four dual-core Carmel CPU clusters based on ARMv8.2. Each core includes 128KB instruction and 64KB data L1 caches and a 2MB L2 cache shared between the two cores. The CPU clusters share a 4MB L3 cache.
The integrated Volta GPU has 512 CUDA cores and 64 Tensor Cores with a 128KB L1 cache, and a 512KB L2 cache.
Deep Learning Accelerator (DLA)
According to Nvidia,
The majority of compute effort for Deep Learning inference is based on mathematical operations that can mostly be grouped into four parts: convolutions; activations; pooling; and normalization. These operations share a few characteristics that make them particularly well suited for special-purpose hardware implementation: their memory access patterns are extremely predictable, and they are readily parallelized. The NVIDIA Deep Learning Accelerator (NVDLA) project promotes a standardized, open architecture to address the computational demands of inference.
NVDLA hardware is comprised of the following separate and independently configurable components.
- Convolution Core for convolution layers.
- Single Data Processor — activation function lookup engine.
- Planar Data Processor — for pooling.
- Channel Data Processor — multi-channel averaging engine for advanced normalization functions.
- Dedicated Memory and Data Reshape Engines — memory-to-memory transformation for tensor reshape and copy.
The CPU is responsible for scheduling tasks for NVDLA. In the “small” NVDLA system (the left diagram), it will typically execute one task at a time.
In the “Large” system, the fine-grained scheduling and NVDLA management are offloaded to a microcontroller. It limits the interrupt load on the main processor and includes an optional dedicated high-bandwidth SRAM as a cache. This allows the system to run multiple tasks.
The following diagram is the internal architecture for an NVDLA core.
Vision Accelerator (VA)
The system has two Vision Accelerator engines each includes a dual 7-way VLIW (Very Long Instruction Word) Vector processor responsible for executing vision-related algorithms, including feature detection & matching, optical flow, stereo disparity block matching, point cloud processing, imaging filters such as convolutions, morphological operators, histogramming, colorspace conversion, and warping.
Jetson AGX Xavier Developer Kit
The Jetson AGX Xavier Developer Kit will include a non-production Jetson module and a reference carrier board to host the Jetson module and the I/O. Its purpose is to have a self-contained computer system that developers can start developing applications.
The I/O interfaces included are:
Here is the reference carrier board (left: top view and right: bottom view) that the Xavier module plugs into.
This is called an SoM (System on a module — a.k.a. system on a module board) because after plugging in with the monitor, mouse, and keyboard, you have a full computer system. Connecting the kit to a monitor, you are ready to run the system as Linux and play video games.
Jetpack SDK & Jetson SDK
Jetpack SDK and Jetson SDK provide the software packages needed for software development on the Jetson devices. JetPack SDK includes the Linux Driver Package (L4T) with Linux OS and CUDA-X accelerated libraries and APIs for Deep Learning, computer vision, accelerated Computing, and multimedia (TensorRT, cuDNN, CUDA Toolkit, VisionWorks, GStreamer, and OpenCV).
DeepStram SDK is offered for video analytics and ISAAC SDK for autonomous machines with the Jetson SDK.
Jetson Xavier NX
While the Jetson AGX Xavier module is the most powerful solution offered in the Jetson product line, Nvidia offers the Jetson Xavier NX for the mid-end market. Here is a nice demonstration of Jetson Xavier NX.
The following diagram is the hardware configuration for the developer kit.
And this is a simplified system diagram.
Like Jetson AGX Xavier, Nvidia sells the Jetson Xavier NX developer kit and the Jetson Xavier NX module. The developer kit has a reference carrier board and a Jetson Xavier NX module. But this Xavier NX module is not for production. Instead, a production-ready module that simply called Jetson Xavier NX module is sold separately for production.
What is the difference between the developer kit and the module for production? The developer kit has some slightly different components that are not supported for production purposes. The Jetson Xavier NX development kit uses a microSD card for secondary storage. The production module has onboard eMMC memory. You will need a carrier board designed or procured for your production product, and flash it with the software image you developed. The production module does not include the heatsink, you have to devise a thermal solution (cooling system).
Nvidia offers another SoM, Jetson Nano module, for a low price point around $100.
Google Coral Edge TPU
AlphaGo Zero is trained by self-play reinforcement learning to play GO to beat the GO masters.
Can you imagine that you can purchase an SoM (System-on-module) for around $100 that runs a minimalist engine modeled after AlphaGo Zero. That is the non-official AlphaGo project Minigo and what the Google Coral Edge TPU can perform.
Carol TPU runs on Mendel Linux. We will run through a few specifications just as a reference. Here is the hardware spec for the dev board,
and the baseboard and SoM.
These are the connections provided on the dev board.
This is the datasheet for the SoM,
and the spec for the edge TPU and the memory:
Finally, these are the components in the SoC.
TensorFlow Lite is a lightweight version of TensorFlow (TF) designed for mobile and embedded devices, with much smaller interpreter kernels. First, a model is trained from the regular TF and save as .pb file. The 32-bit model can be further quantized to parameters with 8-bit fixed numbers after computing the dynamic range of the input and the activations. This results in a smaller and faster model ready for inference. Here is the software code:
import tensorflow as tf
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
for _ in range(num_calibration_steps):
# Get sample input data as a numpy array in a
# method of your choosing.
converter.representative_dataset = representative_dataset_gen
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_ops = [tf.lite.OpsSet.EXPERIMENTAL_TFLITE_BUILTINS_ACTIVATIONS_INT16_WEIGHTS_INT8] tflite_quant_model = converter.convert()
Then it is compiled to .tflite file ready for the TF lite.
If some of the operations are not supported by the TPU, it will be run on the CPU.
AI Edge Chip Startups
The competitions for AI Edge chips are pretty fierce in particular for startups.
The Hailo-8 processor performs up to 26 TOPS according to Hailo.
Mythic IPU is tile-based where each tile includes a Matrix Multiply Accelerator (MMA), a RISC-V processor, a SIMD engine, SRAM, and a Network-on-Chip (NoC) router that connects tiles together in a 2D mesh. According to Mythic, the MMA uses analog design coupled with embedded flash memory with approximately 250 billion multiply-accumulate operations per second, at a very low energy cost (0.25 pJ/MAC).
Kneron KL720 AI SoC comes in at 0.9 TOPS per Watt for both computer vision and NLP.
And KL520 is targeted for smart home devices.
Horizon Robotics Journey 2 (征程) is an AI processor in the automotive industry and Sunrise (旭日) is mainly for smart camera. Both chips are designed for edge devices in inferencing.
Perceive’s Ergo can process large DNNs in 20 mW with 55 TOPS/Watt. It requires no external RAM in a small 7x7 mm package.
Syntiant NDP101 Neural Decision Processor is an ultra-low-power consumption processor for always-on speech and audio recognition. It replaces buttons, switches, and dials for devices to wake up to speech rather than touch.