SambaNova is very secretive about the company but yet attracts half a billion VC funding since its founding in 2017. In Part 2, we will put on our investigator hat and see what they may be working on. We dedicate part 2 of this series completely to a single company because its approach is quite novel. It has many design characters of AI chip startups and can lead to another major direction for AI chip development.
However, no outsider knows exactly what SambaNova is working on. But there are plenty of tips left by the researches done by Stanford professors and SambaNova cofounders Kunle Olukotun and Christopher Ré, in particular, their vision on designing computer systems for 2.0. For the discussion in this article, let’s call this new vision System 2.0.
Let’s take a quick overview of the overall workflow first.
Many DL models are developed in standard Deep Learning (DL) platforms, like TensorFlow (TF). As discussed in part 1, there is a large gap between DL models and the modern physical hardware design. System 2.0 will first compile the TF model into an intermediate representation (IR). This representation will capture the logic model with injected memory structure and high-level pipeline hierarchy to bridge this gap (details later). Then the IR is compiled into hardware design and configurations optimized for this IR. To prove this concept, early research used the compiler to generates Chisel files (HDL- hardware description language files) that were further processed to generate bitstreams in programming FPGA. Sound simple and not revolutionary — not exactly.
General AI Chip
Generalization and flexibility act against performance and power consumption. While GPU runs the widest spectrum of ML (machine learning) algorithms with high performance, it consumes a lot of power that hurts scalability in the future.
Google TPU is designed as an ASIC (Application-specific integrated circuit) and highly specialized in a set of ML algorithms that matters a lot in its operation. Google designers think this vertical approach is the way to scale up AI performance in the future. Many researchers have pitched the end of Moores Law as the justification of new architecture designs. But this has been pitched over the last 30 years already. Instead, the performance-per-Watt may be a better indicator of what we need to focus on. Indeed, SambaNova Co-founder Olukotun argues that the industry may already hit a power wall in which the power consumption is plateaued.
In contrast to TPU, System 2.0 focuses on a new general AI chip addressing a wide spectrum of data-intense problems in ML, DL, and data analytics. The key challenge here is whether all these domains can be abstracted and solved using a unified approach. If it can be done, there is a chance to optimize the solution approach in hardware with the flexibility of FPGA but the performance-per-watt near ASIC designs.
In addition, many AI chip developments adopt a bottom-up approach in solving hardware bottlenecks. Some may view it as patching a design targeted for the old world. In System 2.0, Stanford researchers take a top-down approach. They ask how an AI model should be modeled and trained before asking what the hardware may look like. Once, the problem is better formulated, we can design the hardware accordingly with better acceleration. Because of this approach, we will spend the first half of the article on non-hardware topics. If you are interested in hardware only, feel free to skip the next section.
ML Algorithms (Optional)
First, we have to rethink the modern programming paradigm which is designed for consistency and data integrity. Transaction and synchronization are introduced to make sure a bank transfer is 100% success or completely rollback. Precision and determinism are important. But DL is heuristics. The information noise and data sparsity are so high that it has a high tolerance for inaccuracies and mistakes. We don’t need to do it exactly right or deterministic. Many programming golden rules are just unnecessary luxuries that hurt performance. In ML, many rules can be relaxed. They include:
- synchronization and data race protection,
- cache coherence,
- precision and
- communication frequency on weight updates.
Asynchronous SGD weight updates
Hogwild! implements SGD without any locking. It allows concurrent threads to share a single global copy of weights with the possibility of overwriting others’ work. In practice, weight sparsity translates to sparse weight updates. Even many threads run concurrently, memory overwrites are rare. But even it happens, the error introduced is tolerable. For future reference, this paper (with slides) describes the use of asynchronous SGD in Gibss Sampling to improve performance by 2.8x.
The recent DL trends lower the precision in both training and inference for 2x-4x speed improvement. How can we improve accuracy further with lower precision? Currently, we use the same data format throughout the whole training. But as we approach the optimum, the loss gradients decrease. Therefore, we can use fewer bits to encode the data range but more bits for precision. In HALP, weights are rescaled as it approaches the optimum in providing higher precision using the same number of bits (fewer bits for the data range).
Sparse weight communication
Weight sparsity has another implication. In each iteration, many gradient updates will have small magnitudes. But these changes add up and cannot be ignored. So, we communicate larger updates globally but for small changes, we keep it locally until it becomes large enough to be broadcasted. But in early training, we need more aggressive updates. Therefore, during the warm-up period, all changes are communicated.
In Deep Gradient Compression, it applies this sparse communication with
- momentum correction,
- local gradient clipping,
- momentum factor masking, and
- warm-up training to speed up the training without losing accuracy.
The first two techniques are commonly applied in SGD so we will not elaborate it further.
If stalling weight is acceptable, we can tolerate cache incoherence easily. In this paper, it suggests relaxing cache coherence by randomly ignoring cache invalidate requests. This strategy is called obstinate cache and implemented in hardware.
Sparsity is a distinct character that we should explore extensively in DL. The most active development is to prune the zero (or near zero) weights. So we can compress the weights to reduce memory bandwidth and the amount of computation. As shown before, we can also reduce the frequency of weight updates. The third type of sparsity is in data. For example, the one-hot-vector is sparse as it contains mostly zero. GNN networks (Graph Neural Networks) are gaining attention in encoding and making inferences on such sparse data. For example, it can be used to encode the social graph or molecular structure.
Domain-Specific Languages (DSL)
System 2.0 focuses on a generalized solution that works across data-intense problem domains. If we can find an abstract and generalized representation for all these domains, we are one step closer to a general AI chip. This effort will sound like finding an instruction set required for AI applications.
In fact, most data analytic computations, including ML and DL, can be expressed using the data-parallel patterns below. (These patterns are already heavily used in data analytics.)
To demonstrate its potentials, the code below demonstrates how to code an ML algorithm, the k-means clustering, with these patterns.
As another example, the convolution layer of CNN can be expressed with the map and reduce below.
For more complex data-intense applications and DL models, we can concatenate and nest these parallel patterns into a hierarchical structure to represent them.
Our modern programming model is developed for general computing. Concurrency and parallelism are usually discovered at the fine-grain local level only. It is often extremely hard to rediscover the global structure. In short, you can’t see the forest for the trees. But by encoding the algorithms in parallel patterns above, these high-level structures are maintained throughout the process and can be easily optimized later. Let’s get an overview in the next 2 sections before detailing it with an example.
Metapipelining is a pipelining of pipelining. ML algorithms can be realized as a hierarchical dataflow containing pipelines and Metapipelining.
By coding the problem as a hierarchical pipeline, we carry a master concurrency plan that can be accelerated in hardware. In addition, we can use explicit controls to design a more complex dataflow as we wish.
Tiling and Memory Hierarchy
But there are still major gaps between this logical view and the hardware design. In particular, the logical view treats the input as one single data space with unlimited memory. In the physical world, we tile the computation, i.e. we subdivide a workload such that each chunk can fit into the limited memory and computation resources.
In the example below, instead of working on an unlimited large data space, we use the outer “reduce” loop to break up the data in the off-chip memory into tiles (tA and tB) of size B.
In many AI algorithms, data movement is a major performance bottleneck and energy hog. Reducing data movement is critical. By maximizing data locality, we can minimize the data movement between off-chip and on-chip memory. Therefore, to move one step closer to hardware design, System 2.0 introduces the memory hierarchy and data movement into the model. For example, the code below provides a parameterized model (sized with parameters C, D) in declaring data storage using external memory (DRAM), on-chip SRAM tiles, and registers.
Below is another example of moving data onto local distributed SRAM (more detailed example later).
By defining such data movement, the compiler will size the resources properly in avoiding global memory access or complex caching scheme. In summary, tiling and Metapipelining capture locality and nested parallelism. To accelerate AI algorithms, we can take advantage of reconfigurable hardware devices to exploit the nested parallelism in Metapipelining and data locality in the memory hierarchies and tiles.
Let’s get into the details to fill some of the holes in our discussion.
Spatial is a language to encode the IR (intermediate representation) for the hardware accelerator (called accelerator IR). It is also a compiler to translate the IR into hardware design configurations.
Starting with models/codes in a DSL (domain-specific language), like Tensorflow, they can be transformed into IRs containing parallel patterns. Then they are further transformed into the accelerator IR encoded in Spatial. Finally, these IRs will be eventually compiled into hardware configurations describing the hardware designs, say for programming FPGAs.
Say we want to create a hardware solution in accelerating the dot product of two vectors. The logical model for this operation can be represented by the map and reduce pattern below.
The code on the right below is how the dot product will be represented (encoded) in the accelerator IR using Spatial.
As shown, it adds an outer reduce loop to break down the workload with N elements into chunks with B elements, i.e. it tiles the computation. In the outer loop, it loads two tiles (tA and tB) from the DRAM (global shared memory) into SRAM (local memory). Then the inner loop performs the multiplication and the accumulation. It makes B an explicit parameter in which its value will be later optimized by the Spatial compiler according to the capacity of the target hardware and the type of the computation using the tile.
As shown below, the left diagram is the accelerator IR for the dot product of two vectors. The Spatial compiler will take the IR and compiles it into hardware design for the target chips. As a demonstration, the right diagram below is the logical block diagram for the compiled hardware. Data is loaded into a scratchpad memory (a distributed local memory). Then it iterates the elements in computing the dot product. In this example, the inner loops include a 4-way multiplication for parallelization.
In the Spatial IR, many configurations, like sizing information, are implicitly or explicitly parameterized. This defers the decision of their values and allows the compiler to optimize them based on the target device capacity and the DNN. In this example, these parameters include:
- the tile size B that we already discussed,
- how much pipeline (# of stages) or no pipeline (sequential) in the Metapipelining and the inner loop.
Here, we have a 3-stage pipeline below.
In the design below, a double buffer is deployed — one is fetching data and one is feeding data to the pipeline. This allows stages to run without stalling.
- what level of parallelism (4-way) to exploit in each component.
- how much memory banking (how many units of data can be read in parallel from the memory) and the banking mode (details later) for the tiles to maintain a constant supply of data.
Here is the IR with the corresponding parameters (middle section) to be optimized by the compiler.
Again, the diagram below repeats this information using the logical blocks for the hardware design.
For completeness, here is the list of parameters and controls that the compiler can optimize.
Another Spatial Example (Optional)
To demonstrate the power of the Spatial IR, here is another more complex example demonstrating a matrix multiplication.
As shown below, Spatial provides many control primitives in creating a complex hierarchical dataflow.
Plasticine (A Reconfigurable Architecture For Parallel Patterns)
Now, we have an accelerator IR written in Spatial. It encodes a coarse-grain dataflow using hierarchical parallel patterns. It also contains tiling, memory hierarchy, and data movement information. We can use the Spatial compiler to exploit this information to generate a hardware design.
Finally, we are ready to discuss Pasticine — a reconfigurable general processor for data-intense applications including ML and DL. It is designed to optimize parallel patterns with goals to achieve near ASIC performance with high energy efficiency and small die size, but with the programmability and generalization of CPU and GPU.
Plasticine composes of alternating Pattern Compute Units (PCU) and Pattern Memory Units (PMU) connected by switch boxes.
Pattern Compute Unit (PCU)
The Spatial compiler will map the inner loop computation to a PCU, i.e. a PCU is responsible for executing a single, innermost parallel pattern. Each PCU contains a multi-stage SIMD pipeline that combines pipeline parallelism and SIMD for data parallelism.
The following diagram shows a PCU with 4 pipeline stages and 4 SIMD lanes (a.k.a. process a single instruction on 4 data concurrently).
Using a producer-consumer pattern, PCU buffers the vector and scalar input with separate and small FIFOs. Each SIMD lane is composed of functional units (FU) interleaved with pipeline registers (PR). These registers store intermediate scalar results and propagate values between stages within the same lane. Therefore, no external PCU communication or scratchpad accesses are needed for intermediate calculations. The output of these lanes can be merged to form a vector output.
These FUs, containing arithmetic and binary operations, are reconfigurable to implement the function f needed in the parallel pattern.
Crosslane communication between FUs can be done by
- a reduction tree network to reduce values from multiple lanes into a single scalar, or
- a lane shifting network to present PRs with sliding windows across stages.
PCU also contains a control block with reconfigurable combinational logic. This block controls a programmable chain of counters that serve as a state machine that triggers the start of a workflow and used as the loop indices for parallel patterns. (PCU execution begins when the control block enables one of the counters.)
To scale the system, Plasticine follows a hierarchical control structure to control pipelines, i.e. the parent pipeline coordinates the execution of its children. Based on such flow dependencies and the explicit application’s control (if any), we can configure the control block to combine control signals from parent and sibling PCUs, as well as the local FIFOs status, to trigger the execution of a PCU.
Let’s introduce some memory access patterns before we discuss the Pattern Memory Unit (PMU). The memory access pattern in parallel patterns can be simplified as:
- statically predictable linear functions of the pattern indices,
- or random accesses.
The former case usually creates a pattern that we can load the data in parallel easily. For the latter case, we often consolidate multiple requests together. On top of that, data may or may not be reused by PCU. We call them streaming (one time use) and tiled data respectively.
Pattern Memory Unit (PMU)
PMUs contains a programmer-managed scratchpad memory (a distributed local memory) for tiled data.
The scratchpad is coupled with a reconfigurable scalar unit for address calculation (address decoding). To support multiple SIMD lanes in PCU, the scratchpad contains multiple SRAM matching the number of PCU lanes for parallel streaming.
To improve the memory bandwidth for different data access patterns, the address decoding logic can be configured to operate in several banking modes to support various access patterns. These modes are:
- Strided banking mode allows linear access patterns for dense data structures.
- FIFO mode support data streaming.
- Line buffer mode follows a sliding window pattern.
- Duplication mode duplicates contents across all memory banks for parallelized random access (gather operation).
In addition, to support coarse-grained pipeline parallelism, the scratchpad can be configured to operate as an N-buffer with any of the banking modes.
PMU also has a control block that works similar to PCU. Its logic is reconfigurable and is responsible for triggering the PMU read and/or write operations.
The PCU and PMU communicate with other units through a pipelined interconnect (switch). These interconnects are statically configured for custom datapaths. It contains three separate networks and all have the same topology. But each network is responsible for different data types, one for scalar, one for vector, and one for control. Links in switches include registers to prevent long wire delays.
These switches are similar to the interconnects in FPGA. But in FPGA, it is done at a fine-grain level (in bit level). Such fine granularity controls take up a lot of die space and powers. Plasticine involves complex components, like PCU, in a coarse-grain level. By avoiding fine-grain configuration, Plasticine saves a lot of die space and being power efficient. In addition, the place and route algorithms take minutes instead of hours or days in FPGA.
Off-chip Memory Access
Off-chip DRAM is accessed by 4 DDR memory channels through address generators (AG) on two sides of the chip. Each AG contains a reconfigurable scalar unit for address calculation similar to PMU. And each AG contains FIFOs to buffer outgoing commands, data, and incoming responses.
Multiple AGs are connected to an address coalescing unit. The memory commands from AGs can be dense or sparse. Dense requests are converted to multiple DRAM burst requests by the coalescing unit to bulk transfer contiguous DRAM areas. This is commonly used in reading or writing tiles of data. On the other hand, sparse requests are enqueued into a stream of addresses in the coalescing unit. The random reads and writes will be performed with gathers and scatters. This minimizes the number of issued DRAM requests.
Plasticine uses a distributed and hierarchical control scheme to minimize synchronization between units. The controls are done in the outer loop of a nested pipeline which controls the execution sequences of the pipelines that it contains. It supports three types of controller scheme for its children:
- sequential execution — execution is done sequentially without pipeline optimization.
- coarse-grained pipelining, — pipeline optimization is enabled, and
- streaming — it allows the concatenation of units with FIFOs to form a large pipeline.
The diagram below summarizes the type of hardware acceleration that we discussed in supporting the specific programming model.
Mapping Spatial IR to Hardware
Let’s see how the Spatial IR will map into the Plasticine architecture. On the right diagram below,
- the light blue area is responsible for loading data from DRAM. That will be done on AG with the light blue color on the left diagram.
- Tile tA and tB will be corresponding to PMU, both in deep blue.
- The red area executes the inner loop of the parallel pattern and maps it to a PCU.
- A separate PCU in yellow will be responsible for the reduction operation.
The following is how a NN can map into this architecture.
To unify the solutions across data-intense applications, System 2.0 uses parallel patterns to model these problem domains. Regardless of the software platforms that the problem is coded on (TensorFlow or Spark), they can be converted and represented with parallel patterns. This lays down a path in creating a new general AI chip.
First, the DSL application models or codes are converted into parallel patterns — the Parallel Pattern IR.
Using a high-level compiler, information like tiling is added into the Spatia IR to bridge the gap between the logical model and the hardware design. Throughout the process, the IR retains the hierarchical pipelines structures. With Plasticine designed to optimize parallel pattern executions, the Spatial IR can map to the Pasticine easily in a simple and coarse-grain level. In fact, the Spatial compiler only takes minutes to run, in contrast to hours in FPGA’s place and route. This reflects how well Pastcine is designed for executing parallel patterns with minimum design gaps.
Researchers make things perfect and engineers make things cheap.
We don’t have the crystal ball to tell whether System 2.0 is the next computing architect for AI or the existing solutions like GPU may still have a lot of mileage to run. This all depends on the market share and how well engineers can simplify the concept, prioritize goals, and execute them correctly. And many unforeseen problems will pop up and need to be solved. SambaNova is likely working on the compiler as well as a chip similar to Plasticine. The cofounders also address the need for hardware acceleration in handling dynamic precision (HALP), sparsity, and access to training data. In addition, they want to address the labeling complexity with weak supervision. But we don’t have enough detail on how this complete workflow may look like in the context of SambaNova. Or this may be just another Startup concept like the Snorkel AI. Whether System 2.0 or Plasticine will be ahead of its time or everyone will jump into the bandwagon, it can only tell by time — likely in the next couple of years.
We had discussed half of the AI chip companies that we want to cover. In the final article, we will discuss Intel and Mobile SoC, like Apple, Qualcomm, and Samsung. We will also look into other chips specialized for AI accelerations as well as developments in the Asian countries. Finally, we will look into vendors focused on lower-power AI edge devices.