Groq has deconstructed the conventional CPU and designed the chip in which software takes control of the chip.
The Groq Tensor Streaming Processor Architecture follows a growing trend of software control of system functions that has happened in autonomous cars, networks and other hardware.
The architecture hands over the hardware control of the chip to the compiler. The chip has strategically integrated software control units to optimize data movement and processing.
The units are organized in a way that is consistent with the typical data flow found in machine learning models.
“Determinism enables this software-defined hardware approach. We are not concerned with abstracting the details. For us, it’s about controlling the hardware underneath,” says Dennis Abts, chief architect at Groq.
Abts shared the design of the Groq Tensor Streaming Processor Architecture at this week’s Hot Chips conference. Hardware and software co-design isn’t new, but the concept saw a resurgence at the conference, with Intel CEO Pat Gelsinger in a speech pointing to the concept at the heart of the future of chips.
Groq is one of many companies that design chips specifically for AI. The AI chips have functions to determine results based on discovered patterns, probabilities and associations, which is also the basis for the software hardware checks on the architecture.
“What we’ve done is try to prevent some of this waste, fraud and abuse that is emerging at the system level,” Abts said.
System-level complexity often increases with tens to thousands of processing units such as CPUs, GPUs, and smartNICs in heterogeneous computing environments with varying performance, power, and failure profiles.
“This gives you a lot of performance variation in, for example, response time, latency, variation. And that latency variation ultimately slows down an Internet-scale application,” Abts said.
Groq reexamined the hardware-software interfaces on a chip for deterministic processing. The company had to make design choices and uproot conventional chip designs from scratch.
“This enables … an ISA that enables our software stack. We explicitly transfer control to the software, especially the compiler, so that it can reason from a principled standpoint about correctness and plan instructions on the hardware.
At the top, the chip has a static dynamic interface, which gives the compiler a complete picture of a system at any time. That replaces runtime interfaces found on conventional CPUs.
The static dynamic interface allows the hardware to be completely controlled by the compiler, without abstracting the details of the hardware. The compiler has a “wonderful picture of what the hardware does in a given cycle,” Abts said.
Transferring hardware controls to software frees up the hardware to perform other functions. The architecture is different from traditional systems, which embrace out-of-order execution, speculative execution and other techniques to bring parallelism and memory concurrency, Abts said.
The system has 220 MB of “notepad” memory and dedicated “tensors” so that compilers can control the calculations, where they go in a chip, and how they move in each cycle. The chip design makes memory concurrency available throughout the system.
Groq has also broken down functional elements normally found in a conventional CPU, such as integer and vector units, and moved them into separate groups. That’s a lot like merging memory or storage into a single box, with proximity offering performance benefits. This is especially beneficial for AI applications.
The chip design is different from conventional CPUs, and “it allows us to execute in the same way that conventional CPUs break down larger instructions into micro-operations. Similarly, we break down deep learning operations into their constituent smaller micro-operations and we perform those as an ensemble that together achieve a greater goal,” Abts said.
The chip design has matrix multiplication units, which Abts said was the “workhorse” unit. It contains storage units for 409,600 “weights”, providing the parallelism needed to make AI applications faster.
The chip’s building blocks also include SRAM memory, programmable vector units, 480 GB/s network units, and data switches. These are all connected to 144 on-chip instruction control units, which control the transmission of tasks to associated functional units.
“This allows us to keep the hardware overhead of shipping very low. Less than 3 percent of the area is used for decoding and sending instructions,” Abts said.
Groq has also taken a software-defined approach to reduce network congestion.
“The compiler can literally plan the network links, just as it would plan ALU (arithmetic logic unit) or matrix. This alleviates some of the more conventional [hardware-based] approaches,” said Abt, referring specifically to adaptive routing.
“What we’re trying to achieve is predictable and repeatable performance that provides low latency and high throughput across the entire system,” said Abts.