Inception Unveils Mercury: The First Commercial-Scale Diffusion Large Language Model

The landscape of generative AI and LLMs has experienced a remarkable leap forward with the launch of Mercury by the cutting-edge startup Inception Labs. Introducing the first-ever commercial-scale diffusion large language models (dLLMs), Inception labs promises a paradigm shift in speed, cost-efficiency, and intelligence for text and code generation tasks.
Mercury: Setting New Benchmarks in AI Speed and Efficiency
Inception’s Mercury series of diffusion large language models introduces unprecedented performance, operating at speeds previously unachievable with traditional LLM architectures. Mercury achieves remarkable throughput—over 1000 tokens per second on commodity NVIDIA H100 GPUs—a performance that was formerly exclusive to custom-designed hardware like Groq, Cerebras, and SambaNova. This translates to an astonishing 5-10x speed increase compared to current leading autoregressive models.

Diffusion Models: The Future of Text Generation
Traditional autoregressive LLMs generate text sequentially, token-by-token, causing significant latency and computational costs, especially in extensive reasoning and error-correction tasks. Diffusion models, however, leverage a unique “coarse-to-fine” generation process. Unlike autoregressive models restricted by sequential generation, diffusion models iteratively refine outputs from noisy approximations, enabling parallel token updates. This method significantly enhances reasoning, error correction, and overall coherence of the generated content.
While diffusion approaches have proven revolutionary in image, audio, and video generation—powering applications like Midjourney and Sora—their application in discrete data domains such as text and code was largely unexplored until Inception’s breakthrough.
Mercury Coder: High-Speed, High-Quality Code Generation
Inception’s flagship product, Mercury Coder, is optimized specifically for coding applications. Developers now have access to a high-quality, rapid-response model capable of generating code at more than 1000 tokens per second, a dramatic improvement over existing speed-focused models.
On standard coding benchmarks, Mercury Coder doesn’t just match but often surpasses the performance of other high-performing models such as GPT-4o Mini and Claude 3.5 Haiku. Moreover, Mercury Coder Mini secured a top-ranking position on Copilot Arena, tying for second place and outperforming established models like GPT-4o Mini and Gemini-1.5-Flash. Even more impressively, Mercury accomplishes this while maintaining approximately 4x faster speeds than GPT-4o Mini.

Versatility and Integration
Mercury dLLMs function seamlessly as drop-in replacements for traditional autoregressive LLMs. They effortlessly support use-cases including Retrieval-Augmented Generation (RAG), tool integration, and agent-based workflows. The diffusion model’s parallel refinement allows multiple tokens to be updated simultaneously, ensuring swift and accurate generation suitable for enterprise environments, API integration, and on-premise deployments.
Built by AI Innovators
Inception’s technology is underpinned by foundational research at Stanford, UCLA and Cornell from its pioneering founders, recognized for their crucial contributions to the evolution of generative AI. Their combined expertise includes the original development of image-based diffusion models and innovations such as Direct Preference Optimization, Flash Attention, and Decision Transformers—techniques widely acknowledged for their transformative impact on modern AI.
Inception’s introduction of Mercury marks a pivotal moment for enterprise AI, unlocking previously impossible performance levels, accuracy, and cost-efficiency.
Check out the Playground and Technical details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.