Speech & Audio

smaller, faster, cheaper • The Register

smaller, faster, cheaper • The Register


Analysis Whether or not OpenAI’s new open weights models are any good is still up for debate, but their use of a relatively new data type called MXFP4 is arguably more important, especially if it catches on among OpenAI’s rivals.

The format promises massive compute savings compared to data types traditionally used by LLMs, allowing cloud providers or enterprises to run them using just a quarter of the hardware.

What the heck is MXFP4?

If you’ve never heard of MXFP4, that’s because, while it’s been in development for a while now, OpenAI’s gpt-oss models are among the first mainstream LLMs to take advantage of it.

This is going to get really nerdy, really quickly here, so we won’t judge if you want to jump straight to the why it matters section.

MXFP4 is a 4-bit floating point data type defined by the Open Compute Project (OCP), the hyperscaler cabal originally kicked off by Facebook in 2011 to try and make datacenter components cheaper and more readily available. Specifically, MXFP4 is a micro-scaling block floating-point format, hence the name MXFP4 rather than just FP4.

This micro-scaling function is kind of important, as FP4 doesn’t offer a whole lot of resolution on its own. With just four bits — one for the sign bit, two for the exponent, and one for the mantissa — it can represent 16 distinct values: eight positive and eight negative. That’s compared to BF16, which can represent 65,536 values.

If you took these four BF16 values, 0.0625, 0.375, 0.078125, and 0.25, and converted them directly to FP4, their values would now be 0, 0.5, 0, and 0.5 due to what becomes rather aggressive rounding.

Through some clever mathematics, MXFP4 is able to represent a much broader range of values. This is where the scaling bit of MX data types comes into play.

Here's a basic overview of how MX datatypes work

Here’s a basic overview of how MX datatypes work – Click to enlarge

MXFP4 quantization works by taking a block of higher-precision values (32 by default) and multiplying them by a common scaling factor in the form of an 8-bit binary exponent. Using this approach, our four BF16 values become 1, 6, 1.5, and 4. As you’ve probably already noticed, that’s a big improvement over standard FP4.

This is sort of like how FP8 works, but rather than applying the scaling factor to the entire tensor, MXFP4 applies this to smaller blocks within the tensor, allowing for much greater granularity between values.

During inference, these figures are then de-quantized on the fly by multiplying the inverse of their 4-bit floating point value by the scaling factor, resulting in: 0.0625, 0.375, 0.09375, and 0.0625. We still run into rounding errors, but it’s still more precise than 0, 0.5, 0, 0.5.

MXFP4, we should note, is only one of several micro-scaling data types. There are also MXFP6 and even MXFP8 versions, which function similarly in principle.

Why MXFP4 matters

MXFP4 matters because the smaller the weights are, the less VRAM, memory bandwidth, and potentially compute are required to run the models. In other words, MXFP4 makes genAI a whole lot cheaper.

How much cheaper? Well, that depends on your point of reference. Compared to a model trained at BF16 — the most common data type used for LLMs these days — MXFP4 would cut compute and memory requirements by roughly 75 percent.

We say roughly because realistically you won’t be quantizing every model weight. According to the gpt-oss model card [PDF], OpenAI said it applied MXFP4 quantization to about 90 percent of the model’s weights. This is how they were able to cram a 120 billion parameter model into a GPU with just 80GB of VRAM or the smaller 20 billion parameter version on one with as little as 16GB of memory.

By quantizing gpt-oss to MXFP4, the LLM doesn’t just occupy 4x less memory than an equivalently sized model trained at BF16, but can generate tokens up to 4x faster as well.

Some of that will depend on the compute. As a general rule, every time you halve the floating point precision, you can double the chip’s floating point throughput. A single B200 SXM module offers about 2.2 petaFLOPS of dense BF16 compute. Drop down to FP4, which Nvidia’s Blackwell silicon offers hardware acceleration for, and that jumps to nine petaFLOPS.

While this may boost throughput a little, when it comes to inference, more FLOPS really means less time waiting for the model to start generating its answer.

To be clear, your hardware doesn’t need native FP4 support to work with MXFP4 models. Nvidia’s H100s, which were used to train gpt-oss, don’t support FP4 natively, yet can run the models just fine. It just doesn’t enjoy all the data types’ benefits.

OpenAI is setting the tone

Quantization isn’t a new concept. Model devs have been releasing FP8 and even 4-bit quantized versions of their models for a while now. 

However, these quants are often perceived as a compromise, as lower precision inherently comes with a loss in quality. How significant that loss is depends on the specific quantization method, of which there are many.

That said, research has repeatedly shown the loss in quality going from 16 bits to eight is essentially nil, at least for LLMs anyway. There’s still enough information at this precision for the model to work as intended. In fact, some model builders like DeepSeek have started training models natively in FP8 for this reason. 

While vastly better than standard FP4, MXFP4 isn’t necessarily a silver bullet. Nvidia argues the data type can still suffer from a degradation compared to FP8, in part because its 32-value block sizes aren’t granular enough. To address this, the GPU giant has introduced its own micro-scaling data type called NVFP4, which aims to improve quality by using 16-value blocks and an FP8 scaling factor.

Ultimately, however, it’s up to the Enterprise, API, or cloud provider to decide whether to deploy the quant or stick with the original BF16 release.

With gpt-oss, OpenAI has made that choice for them. There is no BF16 or FP8 version of the models. MXFP4 is all we get. Given their outsized position in the market, OpenAI is basically saying, if MXFP4 is good enough for us, it should be good enough for you.

And that’s no doubt welcome news for the infrastructure providers tasked with serving these models. Cloud providers in particular don’t get much say in what their customers do with the resources they’ve leased. The more model builders that embrace MXFP4, the more likely folks are to use it.

Until then, OpenAI gets to talk up how much easier its open models are to run than everyone else’s and how they can take advantage of newer chips from Nvidia and AMD that support the FP4 data type natively. ®

smaller, faster, cheaper • The Register

Source link