AI Chips Technology Trends & Landscape (SambaNova)

SambaNova

Source
Source

General AI Chip

Source

ML Algorithms (Optional)

  • synchronization and data race protection,
  • cache coherence,
  • precision and
  • communication frequency on weight updates.
Source
  • momentum correction,
  • local gradient clipping,
  • momentum factor masking, and
  • warm-up training to speed up the training without losing accuracy.

Domain-Specific Languages (DSL)

Source
Source
Source
Source

Design Gap

Compute a dot product for two vectors
Source
Source

Spatial

Source
Source
Source
Source
  • the tile size B that we already discussed,
Source
  • how much pipeline (# of stages) or no pipeline (sequential) in the Metapipelining and the inner loop.
Source
Source
  • what level of parallelism (4-way) to exploit in each component.
Source
  • how much memory banking (how many units of data can be read in parallel from the memory) and the banking mode (details later) for the tiles to maintain a constant supply of data.
Source
Source
Source

Another Spatial Example (Optional)

Source matrix multiplication (𝐶 = 𝐴 · 𝐵)
Source

Plasticine (A Reconfigurable Architecture For Parallel Patterns)

Tiled architecture with reconfigurable SIMD pipelines, distributed scratchpads, and statically programmed switches
Source
Pattern Compute Unit (PCU) architecture shown with 4 stages and 4 SIMD lanes
Source
  • a reduction tree network to reduce values from multiple lanes into a single scalar, or
  • a lane shifting network to present PRs with sliding windows across stages.
  • statically predictable linear functions of the pattern indices,
  • or random accesses.
Source
Pattern Memory Unit (PMU) architecture: configurable scratchpad, address calculation datapath, and control
  • Strided banking mode allows linear access patterns for dense data structures.
  • FIFO mode support data streaming.
  • Line buffer mode follows a sliding window pattern.
  • Duplication mode duplicates contents across all memory banks for parallelized random access (gather operation).
Source: Gather data without the requirement that addresses be contiguous.
Source (Demonstrate the double-buffer concept in a pipeline)
Source: Kunle Olukotun
Source
A general example of gathers and scatters
  • sequential execution — execution is done sequentially without pipeline optimization.
  • coarse-grained pipelining, — pipeline optimization is enabled, and
  • streaming — it allows the concatenation of units with FIFOs to form a large pipeline.

Optimization

Programming model components and their corresponding hardware implementation requirements

Mapping Spatial IR to Hardware

  • the light blue area is responsible for loading data from DRAM. That will be done on AG with the light blue color on the left diagram.
  • Tile tA and tB will be corresponding to PMU, both in deep blue.
  • The red area executes the inner loop of the parallel pattern and maps it to a PCU.
  • A separate PCU in yellow will be responsible for the reduction operation.
Source
Source

Recap

Source

Thoughts

Next

Credit & References

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store