Elon Musk, Diffusion Models, and the Rise of Mercury

  1. A New Paradigm May Be Forming
  2. Meet Inception Labs and Mercury
  3. How Mercury Works
  4. Inside the Diffusion Revolution
  5. Training and Scale
  6. Performance: 10× Faster, Same Quality
  7. A Historical Echo
  8. What Comes Next
  9. Further Reading

A New Paradigm May Be Forming

In a recent exchange on X, Elon Musk echoed a striking prediction: diffusion models — the same architecture that powers image generators like Stable Diffusion — could soon dominate most AI workloads. Musk cited Stanford professor Stefano Ermon, whose research argues that diffusion models’ inherent parallelism gives them a decisive advantage over the sequential, autoregressive transformers that currently power GPT-4, Claude, and Gemini.

While transformers have defined the past five years of AI, Musk’s comment hints at an impending architectural shift — one reminiscent of the deep learning revolutions that came before it.


Meet Inception Labs and Mercury

That shift is being engineered by Inception Labs, a startup founded by Stanford professors including Ermon himself. Their flagship system, Mercury, is the world’s first diffusion-based large language model (dLLM) designed for commercial-scale text generation.

The company recently raised $50 million to scale this approach, claiming Mercury achieves up to 10× faster inference than comparable transformer models by eliminating sequential bottlenecks. The vision: make diffusion not just for pixels, but for language, video, and world modeling.


How Mercury Works

Traditional LLMs — whether GPT-4 or Claude — predict the next token one at a time, in sequence. Mercury instead starts with noise and refines it toward coherent text in parallel, using a denoising process adapted from image diffusion.

This process unfolds in two stages:

  1. Forward Process: Mercury gradually corrupts real text into noise across multiple steps, learning the statistical structure of language.
  2. Reverse Process: During inference, it starts from noise and iteratively denoises, producing complete sequences — multiple tokens at once.

By replacing next-token prediction with a diffusion denoising objective, Mercury gains parallelism, error correction, and remarkable speed. Despite this radical shift, it retains transformer backbones for compatibility with existing training and inference pipelines (SFT, RLHF, DPO, etc.).


Inside the Diffusion Revolution

Mercury’s text diffusion process operates on discrete token sequences x \in X. Each diffusion step samples and refines latent variables z_t that move from pure noise toward meaningful text representations. The training objective minimizes a weighted denoising loss:

L(x) = -\mathbb{E}t [\gamma(t) \cdot \mathbb{E}{z_t \sim q} \log p_\theta(x | z_t)]

In practice, this means Mercury can correct itself mid-generation — something autoregressive transformers fundamentally struggle with. The result is a coarse-to-fine decoding loop that predicts multiple tokens simultaneously, improving both efficiency and coherence.


Training and Scale

Mercury is trained on trillions of tokens spanning web, code, and curated synthetic data. The models range from compact “Mini” and “Small” versions up to large generalist systems with context windows up to 128K tokens. Inference typically completes in 10–50 denoising steps — orders of magnitude faster than sequential generation.

Training runs on NVIDIA H100 clusters using standard LLM toolchains, with alignment handled via instruction tuning and preference optimization.


Performance: 10× Faster, Same Quality

On paper, Mercury’s numbers are eye-catching:

BenchmarkMercury Coder MiniMercury Coder SmallGPT-4o MiniClaude 3.5 Haiku
HumanEval (%)88.090.0~8590+
MBPP (%)76.677.1~75~78
Tokens/sec (H100)110973759~100
Latency (ms, Copilot Arena)25N/A~100~50

Mercury rivals or surpasses transformer baselines on code and reasoning tasks, while generating 5–20× faster on equivalent hardware. Its performance on Fill-in-the-Middle (FIM) benchmarks also suggests diffusion’s potential for robust, parallel context editing — a key advantage for agents, copilots, and IDE integrations.


A Historical Echo

Machine learning has cycled through dominant architectures roughly every decade:

  • 2000s: Convolutional Neural Networks (CNNs)
  • 2010s: Recurrent Neural Networks (RNNs)
  • 2020s: Transformers

Each leap offered not just better accuracy, but better compute scaling. Diffusion may be the next inflection point — especially as GPUs, TPUs, and NPUs evolve for parallel workloads.

Skeptics, however, note that language generation’s discrete structure may resist full diffusion dominance. Transformers enjoy massive tooling, dataset, and framework support. Replacing them wholesale won’t happen overnight. But if diffusion proves cheaper, faster, and scalable, its trajectory may mirror the very transformers it now challenges.


What Comes Next

Inception Labs has begun opening Mercury APIs at platform.inceptionlabs.ai, pricing at $0.25 per million input tokens and $1.00 per million output tokens — a clear signal they’re aiming at OpenAI-level production workloads. The Mercury Coder Playground is live for testing, and a generalist chat model is now in closed beta.

If Musk and Ermon are right, diffusion could define the next chapter of AI — one where text, video, and world models share the same generative backbone. And if Mercury’s numbers hold, that chapter may arrive sooner than anyone expects.


Further Reading

  • Stefano Ermon et al., Diffusion Language Models Are Parallel Transformers (Stanford AI Lab)
  • Elon Musk on X, Diffusion Will Likely Dominate Future AI Workloads
  • Inception Labs, Mercury Technical Overview (2025)

Rise of AI Development Environments

The rise of Cursor, Copilot + VSCode, Replit, and Qwen2.5 among others, have caused me to rethink my ways. Focus will still be key in discerning what to build.


AI development environments change the global technology conversation. They also influence the pace of hiring and team augmentation decisions.

Qwen2.5-Coder Open Source

Alibaba Group has released the Qwen2.5-Coder open-source model. Qwen2.5-Coder-32B-Instruct is currently the best-performing open-source code model (SOTA), matching the coding capabilities of GPT-4o. Qwen2.5-Coder offers six different model sizes: 0.5B, 1.5B, 3B, 7B, 14B, and 32B.

Project: https://qwenlm.github.io/blog/qwen2.5-coder-family/

Each size provides both Base and Instruct models. The Instruct model engages in direct dialogue. The Base model serves as a foundational model for developers to fine-tune.

Github: https://github.com/QwenLM/Qwen2.5-Coder

Huggingface: https://huggingface.co/collections/Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f
Additionally, it provides two scenarios, code assistants and Artifacts, for exploration.

Code Assistants: https://huggingface.co/spaces/Qwen/Qwen2.5-Coder-demo
Artifacts: https://huggingface.co/spaces/Qwen/Qwen2.5-Coder-Artifacts

Emergent abilities in LLMs

Are Emergent Abilities of Large Language Models a Mirage?
Authored by Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo
Computer Science, Stanford University

https://arxiv.org/pdf/2304.15004.pdf

This work challenges the notion of emergent abilities in large language models, suggesting that these abilities are not inherent to the model’s scale but rather a result of the choice of metrics used in research. Emergent abilities are defined as new capabilities that appear abruptly and unpredictably as the model scales up. The authors propose that when a specific task and model family are analyzed with fixed model outputs, the appearance of emergent abilities is influenced by the type of metric chosen: nonlinear or discontinuous metrics tend to show emergent abilities, whereas linear or continuous metrics show smooth, predictable changes in performance.

To support this hypothesis, the authors present a simple mathematical model and conduct three types of analyses:

  1. Examining the effect of metric choice on the InstructGPT/GPT-3 family in tasks where emergent abilities were previously claimed.
  2. Performing a meta-analysis on the BIG-Bench project to test predictions about metric choices in relation to emergent abilities.
  3. Demonstrating how metric selection can create the illusion of emergent abilities in various vision tasks across different deep networks.

Their findings suggest that what has been perceived as emergent abilities could be an artifact of certain metrics or insufficient statistical analysis, implying that these abilities might not be a fundamental aspect of scaling AI models.

Emergent abilities of large language models are created by the researcher’s chosen
metrics, not unpredictable changes in model behavior with scale.

The term “emergent abilities of LLMs” was recently and crisply defined as “abilities that are not
present in smaller-scale models but are present in large-scale models; thus they cannot be predicted
by simply extrapolating the performance improvements on smaller-scale models”. Such emergent abilities were first discovered in the GPT-3 family. Subsequent work emphasized the discovery, writing that “[although model] performance is predictable at a general level, performance on a
specific task can sometimes emerge quite unpredictably and abruptly at scale”.

https://arxiv.org/pdf/2304.15004.pdf

How can mixed reality drive more engagement in movement and fitness?

Fitness is one of the most robust categories under discussion, across Augmented Reality and Virtual Reality devices. For whom does this movement level merit the moniker “fitness”? And what timeline are we working with to see the sweeping adoption of fitness via spatial computing (the term widely known now due to Apple’s Vision Pro announcement and curving of the terms VR / AR / MR collectively)?

I’m seeing new unlocks particularly as it relates to the comfort of the device, spatial awareness afforded due to camera passthrough, and greater respect for ergonomic polish among developers.

The video seen here is a clip taken November 8th, 2023, showing a first-person view of a Quest 3 experience that allows for gestures, hand tracking, and movement to be used as input to an increasing number of games.

The title is built by YUR, the app name is YUR World.

Reblog: Refining Images using Convolutional Neural Networks (CNN) as Proxies by Jasmine L. Collins

For my Machine Learning class project, I decided to look at whether or not we can refine images by using techniques for visualizing what a Convolutional Neural Network (CNN) trained on an image recognition task has learned.

from Pocket

via Did you enjoy this article? Then read the full version from the author’s website.