Speech & Audio

Devs are finally getting serious about efficiency • The Register

Devs are finally getting serious about efficiency • The Register


Feature If you’ve been following AI development over the past few years, one trend has remained constant: bigger models are usually smarter, but also harder to run.

This is particularly problematic in parts of the world where access to America’s most sophisticated AI chips is restricted – like, say, China.

But even outside of China, model builders are increasingly turning to mixture of experts (MoE) architectures along with emerging compression tech to drive down the compute requirements of serving large language models (LLMs). Nearly three years since ChatGPT kicked off the generative AI boom, it seems folks are finally starting to think about the cost of running these things.

To be clear, we’ve seen MoE models, like Mistral AI’s Mixtral, before, but it’s only in the last year or so the technology has really taken off.

Over the past few months, we’ve seen a wave of new open-weight LLMs from the likes of Microsoft, Google, IBM, Meta, DeepSeek, and Alibaba based on some kind of mixture-of-experts (MoE) architecture.

And the reason is simple: The architecture is a helluva lot more efficient than traditional “dense” model architectures.

Vaulting the memory wall

First described in the early ’90s in a paper [PDF] titled “Adaptive Mixtures of Local Experts,” the basic idea is that instead of one great big model trained on a bit of everything, work is routed to one or more of any number of smaller sub-models, or “experts.”

In theory, each of these experts can be optimized for a domain-specific task, like coding, mathematics, or writing. Unfortunately, few model builders go into much detail about the various experts that make up their MoE models, and the exact number varies from model to model. The important bit is only a small portion of the model is in use at any given moment.

For example, DeepSeek’s V3 model is composed of 256 routed experts along with one shared expert. But only eight routed experts, plus the shared one, are activated per token.

Because of this, MoE models don’t always match the quality of similarly sized dense models. Take Alibaba’s Qwen3-30B-A3B MoE model for example. It consistently fell behind the dense Qwen3-32B model in Alibaba’s own benchmark testing.

The loss in quality – at least if the benchmarks are to be believed – is pretty minor compared to the leap in efficiency gained from the MoE architecture. Fewer active parameters also mean the amount of memory bandwidth required to achieve a given level of performance is no longer proportional to the capacity needed to store the model weights.

In other words, MoE models may still need a ton of memory, but it doesn’t all have to be ultra-fast or ultra-expensive HBM anymore.

To illustrate this, let’s compare the system requirements for Meta’s largest “dense” model, Llama 3.1 405B, to Llama 4 Maverick, which is nearly as big, but uses a MoE architecture with 17 billion active parameters.

Factors like batch size, floating point performance, and the key-value cache all play into real-world performance, but we can at least get a rough sense of the minimum bandwidth requirements of a model by multiplying its size in gigabytes at a given precision (1 byte per parameter for 8-bit models) by the target tokens per second at a batch size of one.

To run an 8-bit quantized version of Llama 3.1 405B — more on quantization in a bit — you’d need more than 405 GB of vRAM and at least 20 TB/s of memory bandwidth in order to generate text at 50 tokens per second.

For reference, Nvidia’s HGX H100-based systems, which we’ll remind you were selling for $300,000 or more until recently, only had 640 GB of HBM3 and about 26.8 TB/s of aggregate bandwidth. If you wanted to run the full 16-bit model, you would have needed at least two of them.

By comparison, Llama 4 Maverick still consumes the same amount of memory, but needs less than 1 TB/s of bandwidth to achieve the same performance. That’s because only 17 billion parameters worth of model experts are actually used to generate the output.

That means, on the same hardware, Llama 4 Maverick should generate text an order of magnitude faster than Llama 3.1 405B.

On the other hand, if performance isn’t as big a concern, you can now get away with running many of these models on cheaper, albeit slower GDDR6, GDDR7, or even DDR in the case of Intel’s latest Xeons.

Nvidia’s new RTX Pro Servers, announced at Computex this week, are primed to do just that. Rather than high-bandwidth memory (HBM), which is expensive, power-hungry, and requires advanced packaging to integrate, each of the eight RTX Pro 6000 GPUs found in the systems feature 96 GB of GDDR7 memory — the same kind you’d find in a modern gaming card.

Combined, these systems offer up to 768 GB of vRAM and 12.8 TB/s of aggregate bandwidth — more than enough to run Llama 4 Maverick at several hundred tokens per second.

Nvidia hasn’t shared pricing, but with the workstation edition of these cards currently retailing for around $8,500, we wouldn’t be surprised to find them selling for less than half of what an HGX H100 used to go for.

With that said, MoE doesn’t spell an end for HBM-stacked GPUs. We don’t expect we’ll see Llama 4 Behemoth — assuming it ever ships — running on anything short of a rack full of GPUs.

While the thing has roughly half the active parameters as Llama 3.1 405B, it’s got 2 trillion of them in total. There’s not a single conventional GPU server on the market today that can fit the full 16-bit model and what’ll inevitably be a million-plus token context window.

Are CPUs finally having their AI moment?

Depending on your use case, you may not need a GPU at all — something that might come in handy in regions where imports of high-end accelerators are restricted.

Back in April, Intel demoed a dual-socket Xeon 6 platform equipped with a full complement of 8800 MT/s MCRDIMMs, achieving a throughput in Llama 4 Maverick of 240 Tokens per second at an average output latency of less than 100 ms per token.

Put more succinctly, the Xeon platform was able to maintain 10 tokens per second or better per user for roughly 24 concurrent users.

Intel didn’t share batch 1 (single user) performance — and we can’t blame them as it’s not all that relevant of a metric in the real world — but a little back-of-the-napkin math says the most it could have been right around 100 tokens per second.

With that said, unless you don’t have any better options, or have very specific needs, the economics of CPU-based inference are still heavily dependent on your use case.

Cutting weights: pruning and quantization

MoE architectures can certainly reduce the memory bandwidth required to serve larger models, but they don’t do anything to reduce the amount of memory required to hold their weights. Like we mentioned earlier, even at 8-bit precision, Llama 4 Maverick still needs in excess of 400 GB of memory to run, regardless of how many parameters are active.

However, emerging pruning techniques and quantization could, with a little extra work, cut that in half without compromising on quality.

Nvidia has been betting on pruning for some time now. The GPU giant has released several pruned versions — models that have had redundant or less valuable weights discarded — of Meta’s Llama 3 models.

It was also among the first to extend support for 8-bit floating point datatypes in 2022, and again with 4-bit floating point with the launch of its Blackwell architecture in 2024. Meanwhile, AMD’s first chips to offer native FP4 support are expected to make their debut next month.

While not strictly necessary, native hardware support for these datatypes generally reduces the likelihood of running into compute bottlenecks, especially when serving at scale.

At the same time, we’ve seen a number of model builders embrace lower-precision datatypes, including Meta, Microsoft, Alibaba, and others offering eight-bit and even four-bit quantized versions of their models.

We’ve previously explored quantization in depth, but in a nutshell, it involves compressing model weights from their native precision, usually BF16, down to FP8 or INT4. This effectively halves or even quarters the memory bandwidth and capacity requirements of the models, at the expense of some quality loss.

In general, the losses going from 16 bits to eight usually aren’t noticeable, and some model builders, including DeepSeek, have started training at FP8 precision from the get-go. But carve off another four bits and the loss in quality can be quite pronounced. Because of this, many post-training approaches to quantization, like GGUF, don’t compress all of the weights equally, leaving some at higher precisions to limit the losses.

Last month, Google demonstrated the use of quantization-aware training (QAT) to shrink its Gemma 3 models by a factor of 4x while achieving quality close to native BF16.

QAT works by simulating low-precision operations during the training process. By applying the tech for around 5,000 steps on an unqualified model, Google says it was able to reduce the drop in perplexity — a metric for measuring quantization-related losses — by 54 percent when converted to INT4.

Another QAT-based approach to quantization called Bitnet aims to go even lower, compressing models to just 1.58 bits, or roughly a tenth of their size.

Tying it all together

Combine MoE and 4-bit quants and you’re really cooking, especially if you’re bandwidth-constrained by Blackwell Ultra sticker shock, or because Uncle Sam’s trade policies have made HBM more valuable than gold.

For everyone else, either one of the two technologies can significantly reduce the equipment and operating cost of running larger, more capable models – assuming you can find something valuable for them to do.

And if you can’t, you can at least take solace in the fact you’re not alone. A recent IBM survey of 2,000 CEOs found that just a quarter of AI deployments had delivered the return on investment they’d promised. ®

Devs are finally getting serious about efficiency • The Register

Source link