- A New Paradigm May Be Forming
- Meet Inception Labs and Mercury
- How Mercury Works
- Inside the Diffusion Revolution
- Training and Scale
- Performance: 10× Faster, Same Quality
- A Historical Echo
- What Comes Next
- Further Reading
A New Paradigm May Be Forming
In a recent exchange on X, Elon Musk echoed a striking prediction: diffusion models — the same architecture that powers image generators like Stable Diffusion — could soon dominate most AI workloads. Musk cited Stanford professor Stefano Ermon, whose research argues that diffusion models’ inherent parallelism gives them a decisive advantage over the sequential, autoregressive transformers that currently power GPT-4, Claude, and Gemini.
While transformers have defined the past five years of AI, Musk’s comment hints at an impending architectural shift — one reminiscent of the deep learning revolutions that came before it.
Meet Inception Labs and Mercury
That shift is being engineered by Inception Labs, a startup founded by Stanford professors including Ermon himself. Their flagship system, Mercury, is the world’s first diffusion-based large language model (dLLM) designed for commercial-scale text generation.
The company recently raised $50 million to scale this approach, claiming Mercury achieves up to 10× faster inference than comparable transformer models by eliminating sequential bottlenecks. The vision: make diffusion not just for pixels, but for language, video, and world modeling.
How Mercury Works
Traditional LLMs — whether GPT-4 or Claude — predict the next token one at a time, in sequence. Mercury instead starts with noise and refines it toward coherent text in parallel, using a denoising process adapted from image diffusion.
This process unfolds in two stages:
- Forward Process: Mercury gradually corrupts real text into noise across multiple steps, learning the statistical structure of language.
- Reverse Process: During inference, it starts from noise and iteratively denoises, producing complete sequences — multiple tokens at once.
By replacing next-token prediction with a diffusion denoising objective, Mercury gains parallelism, error correction, and remarkable speed. Despite this radical shift, it retains transformer backbones for compatibility with existing training and inference pipelines (SFT, RLHF, DPO, etc.).
Inside the Diffusion Revolution
Mercury’s text diffusion process operates on discrete token sequences x \in X. Each diffusion step samples and refines latent variables z_t that move from pure noise toward meaningful text representations. The training objective minimizes a weighted denoising loss:
L(x) = -\mathbb{E}t [\gamma(t) \cdot \mathbb{E}{z_t \sim q} \log p_\theta(x | z_t)]
In practice, this means Mercury can correct itself mid-generation — something autoregressive transformers fundamentally struggle with. The result is a coarse-to-fine decoding loop that predicts multiple tokens simultaneously, improving both efficiency and coherence.
Training and Scale
Mercury is trained on trillions of tokens spanning web, code, and curated synthetic data. The models range from compact “Mini” and “Small” versions up to large generalist systems with context windows up to 128K tokens. Inference typically completes in 10–50 denoising steps — orders of magnitude faster than sequential generation.
Training runs on NVIDIA H100 clusters using standard LLM toolchains, with alignment handled via instruction tuning and preference optimization.
Performance: 10× Faster, Same Quality
On paper, Mercury’s numbers are eye-catching:
| Benchmark | Mercury Coder Mini | Mercury Coder Small | GPT-4o Mini | Claude 3.5 Haiku |
|---|---|---|---|---|
| HumanEval (%) | 88.0 | 90.0 | ~85 | 90+ |
| MBPP (%) | 76.6 | 77.1 | ~75 | ~78 |
| Tokens/sec (H100) | 1109 | 737 | 59 | ~100 |
| Latency (ms, Copilot Arena) | 25 | N/A | ~100 | ~50 |
Mercury rivals or surpasses transformer baselines on code and reasoning tasks, while generating 5–20× faster on equivalent hardware. Its performance on Fill-in-the-Middle (FIM) benchmarks also suggests diffusion’s potential for robust, parallel context editing — a key advantage for agents, copilots, and IDE integrations.
A Historical Echo
Machine learning has cycled through dominant architectures roughly every decade:
- 2000s: Convolutional Neural Networks (CNNs)
- 2010s: Recurrent Neural Networks (RNNs)
- 2020s: Transformers
Each leap offered not just better accuracy, but better compute scaling. Diffusion may be the next inflection point — especially as GPUs, TPUs, and NPUs evolve for parallel workloads.
Skeptics, however, note that language generation’s discrete structure may resist full diffusion dominance. Transformers enjoy massive tooling, dataset, and framework support. Replacing them wholesale won’t happen overnight. But if diffusion proves cheaper, faster, and scalable, its trajectory may mirror the very transformers it now challenges.
What Comes Next
Inception Labs has begun opening Mercury APIs at platform.inceptionlabs.ai, pricing at $0.25 per million input tokens and $1.00 per million output tokens — a clear signal they’re aiming at OpenAI-level production workloads. The Mercury Coder Playground is live for testing, and a generalist chat model is now in closed beta.
If Musk and Ermon are right, diffusion could define the next chapter of AI — one where text, video, and world models share the same generative backbone. And if Mercury’s numbers hold, that chapter may arrive sooner than anyone expects.
Further Reading
- Stefano Ermon et al., Diffusion Language Models Are Parallel Transformers (Stanford AI Lab)
- Elon Musk on X, Diffusion Will Likely Dominate Future AI Workloads
- Inception Labs, Mercury Technical Overview (2025)

