Cerebras gives waferscale chips an inferencing twist • The Register
Hot Chips Inference performance in many modern generative AI workloads is usually a function of memory bandwidth rather than compute. The faster you can shuttle bits in and out of a high-bandwidth memory (HBM) the faster the model can generate a response.
Cerebra Systems’ first inference offering, based on its previously announced WSE-3 accelerator, breaks with this contention. That’s because instead of HBM, the dinner-plate-sized slab of silicon is so big that the startup says it has managed to pack 44GB of SRAM capable of 21 PBps of bandwidth. To put that in perspective a single Nvidia H200’s HBM3e boasts just 4.8TBps of bandwidth.
According to CEO Andrew Feldman, by using SRAM the part is capable of generating upwards of 1,800 tokens per second when running Llama 3.1 8B at 16-bit precision, compared to upwards of 242 tokens per second on the top performing H100 instance.
Running Llama 3.1 8B, Cerebras says its CS-3 systems can churn out a 1,800 tokens per second – Click to enlarge
When running the 70 billion parameter version of Llama 3.1 distributed across four of its CS-3 accelerators, Cerebras claims to have achieved 450 tokens per second. By comparison, Cerebras says the best the H100 can manage is 128 tokens per second.
Cerebras says its chips can drive a 70 billion parameter model at 450 tokens per second per user. – Click to enlarge
Feldman argues that this level of performance, much like the rise of Broadband, will open up new opportunities for AI adoption. “Today, I think we’re in the dial up era of Gen AI,” he said, pointing to early applications of generative AI where prompts are greeted with a noticeable delay.
If you can process requests quickly enough, he argues that building agentic applications based around multiple models can be done without latency becoming untenable. Another application where Feldman sees this kind of performance being beneficial is by allowing LLMs to iterate on their answers over multiple steps rather than just spitting out their first response. If you can process the tokens quickly enough you can mask the fact this is happening behind the scenes.
But while 1,800 tokens per second might seem fast, and it is, a little back of the napkin math tells us that Cerebra’s WSE-3 should be able spit out tokens way faster if it weren’t for the fact the system is compute constrained.
The offering represents a bit of a shift for Cerebras which until now has largely focused on AI training. However the hardware itself hasn’t actually changed. Feldman tells The Register that it’s using the same WSE-3 chips and CS-3 systems for inference and training. And, no, these aren’t binned parts that didn’t make the cut for training duty — we asked.
“What we’ve done is we’ve extended the capability of the compiler to place multiple layers on a chip at the same time,” Feldman said.
SRAM is fast but makes HBM look positively capacious
While SRAM has obvious advantages over HBM in terms of performance, where it falls short is capacity. When it comes to large language models (LLMs), 44GB just isn’t much when you also have to take into consideration that key value caching takes up a not inconsiderable amount of space at the high batch sizes that Cerebras is targeting.
Meta’s Llama 3 8B model is an idealized scenario of the WSE-3, as at 16GB (FP16) of size, the entire model can fit within the chip’s SRAM, leaving about 28GB of space left over for the key-value cache.
Feldman claims that in addition to extremely high throughput, WSE-3 also can scale to higher batch sizes, though exactly how far it can scale and maintain per user token generation rates the startup hesitated to say.
“Our current batch size is changing frequently. We expect in Q4 to be running batch sizes well into the double digits,” Cerebras told us.
Pressed for more specifics, it added, “Our current batch size # is not mature so we’d prefer not to provide it. The system architecture is designed to operate at high batch sizes and we expect to get there in the next few weeks.”
Much like modern GPU, Cerebras is getting around this challenge by parallelizing models across multiple CS-3 systems. Specifically, Cerebras is using pipeline parallelism to distribute the model’s layers across multiple systems.
For Llama 3 70B, which requires 140GB of memory, the model’s 80 layers are distributed across four CS-3 systems interconnected via ethernet. As you might expect, this does come at a performance penalty as data has to cross those links.
Because the CS-3 only has 44GB of SRAM on board, multiple accelerators need to be sitched together to support larger models – Click to enlarge
However the latency hit, according to Feldman, the node-to-node latency isn’t as big as you might think. “The latency here is real, but small, and it’s amortized over the tokens run through all the other layers on the chip,” he explained. “At the end, the wafer to wafer latency on the token that constitutes about 5 percent of the total.”
For larger models like the recently announced 405 billion parameter variant of Llama 3, Cerebras reckons that it’ll be able to achieve about 350 tokens per second using 12 CS-3 systems.
A knock on Groq
If ditching HBM for SRAM sounds familiar, that’s because Cerebras isn’t the first to go this route. As you might have noticed, Cerebra’s next closest competitor — at least according to their performance claims — is Groq.
Groq’s Language Processing Unit (LPU) actually uses a similar approach to Cerebras in that it relies on SRAM. The difference is that because Groq’s architecture is less SRAM dense, you need a lot more accelerators connected via fiber optics to support any given model.
Where Cerebras needs four CS-3 systems to run Llama 3 70B at 450 tokens per second, Groq has previously said it needed 576 LPUs to break 300 tokens per second. The Artificial Analysis Groq benchmarks cited by Cerebras came in slightly lower at 250 tokens per second.
Feldman is also keen to point out that Cerebras is able to do this without resorting to quantization. Cerebras contends that Groq is using 8-bit quantization to hit their performance targets, which reduces the model size, compute overhead, and memory pressure at the expense of some loss in accuracy. You can learn more about the pros and cons of Quantization in our hands-on here.
Availability
Similar to Groq, Cerebras plans to provide inference services via an OpenAI-compatible API. The advantage of this approach is that developers which have already built apps around GPT-4, Claude, Mistral, or other cloud based models, don’t have to refactor their code to incorporate Cerebra’s inference offering.
In terms of cost, Cerebras is also looking to undercut the competition offering Llama3-70B at a rate of 60 cents per million tokens. And, if you’re wondering, that’s assuming a 3:1 ratio of input to output tokens.
By comparison, Cerebras clocks the cost of serving the same model on H100s on competing clouds at $2.90 / million tokens. Though, as usual with AI inferencing there are a lot of knobs and levers to turn that directly impact the cost and performance of serving a model, so take Cerebra’s claims with a grain of salt.
However, unlike Groq, Feldman says Cerebras will continue to offer on-prem systems for certain customers, like those operating in highly regulated industries.
While Cerebras may have a performance advantage over competing accelerators, the offering is still somewhat limited in terms of supported models. At launch, Cerebras supports both the eight and 70 billion parameter versions of Llama 3.1. However, the startup plans to add support for 405B, Mistral Large 2, Command R+, Whisper, Perplexity Sonar, as well as custom fine-tuned models. ®