Elon Musk, Diffusion Models, and the Rise of Mercury

  1. A New Paradigm May Be Forming
  2. Meet Inception Labs and Mercury
  3. How Mercury Works
  4. Inside the Diffusion Revolution
  5. Training and Scale
  6. Performance: 10× Faster, Same Quality
  7. A Historical Echo
  8. What Comes Next
  9. Further Reading

A New Paradigm May Be Forming

In a recent exchange on X, Elon Musk echoed a striking prediction: diffusion models — the same architecture that powers image generators like Stable Diffusion — could soon dominate most AI workloads. Musk cited Stanford professor Stefano Ermon, whose research argues that diffusion models’ inherent parallelism gives them a decisive advantage over the sequential, autoregressive transformers that currently power GPT-4, Claude, and Gemini.

While transformers have defined the past five years of AI, Musk’s comment hints at an impending architectural shift — one reminiscent of the deep learning revolutions that came before it.


Meet Inception Labs and Mercury

That shift is being engineered by Inception Labs, a startup founded by Stanford professors including Ermon himself. Their flagship system, Mercury, is the world’s first diffusion-based large language model (dLLM) designed for commercial-scale text generation.

The company recently raised $50 million to scale this approach, claiming Mercury achieves up to 10× faster inference than comparable transformer models by eliminating sequential bottlenecks. The vision: make diffusion not just for pixels, but for language, video, and world modeling.


How Mercury Works

Traditional LLMs — whether GPT-4 or Claude — predict the next token one at a time, in sequence. Mercury instead starts with noise and refines it toward coherent text in parallel, using a denoising process adapted from image diffusion.

This process unfolds in two stages:

  1. Forward Process: Mercury gradually corrupts real text into noise across multiple steps, learning the statistical structure of language.
  2. Reverse Process: During inference, it starts from noise and iteratively denoises, producing complete sequences — multiple tokens at once.

By replacing next-token prediction with a diffusion denoising objective, Mercury gains parallelism, error correction, and remarkable speed. Despite this radical shift, it retains transformer backbones for compatibility with existing training and inference pipelines (SFT, RLHF, DPO, etc.).


Inside the Diffusion Revolution

Mercury’s text diffusion process operates on discrete token sequences x \in X. Each diffusion step samples and refines latent variables z_t that move from pure noise toward meaningful text representations. The training objective minimizes a weighted denoising loss:

L(x) = -\mathbb{E}t [\gamma(t) \cdot \mathbb{E}{z_t \sim q} \log p_\theta(x | z_t)]

In practice, this means Mercury can correct itself mid-generation — something autoregressive transformers fundamentally struggle with. The result is a coarse-to-fine decoding loop that predicts multiple tokens simultaneously, improving both efficiency and coherence.


Training and Scale

Mercury is trained on trillions of tokens spanning web, code, and curated synthetic data. The models range from compact “Mini” and “Small” versions up to large generalist systems with context windows up to 128K tokens. Inference typically completes in 10–50 denoising steps — orders of magnitude faster than sequential generation.

Training runs on NVIDIA H100 clusters using standard LLM toolchains, with alignment handled via instruction tuning and preference optimization.


Performance: 10× Faster, Same Quality

On paper, Mercury’s numbers are eye-catching:

BenchmarkMercury Coder MiniMercury Coder SmallGPT-4o MiniClaude 3.5 Haiku
HumanEval (%)88.090.0~8590+
MBPP (%)76.677.1~75~78
Tokens/sec (H100)110973759~100
Latency (ms, Copilot Arena)25N/A~100~50

Mercury rivals or surpasses transformer baselines on code and reasoning tasks, while generating 5–20× faster on equivalent hardware. Its performance on Fill-in-the-Middle (FIM) benchmarks also suggests diffusion’s potential for robust, parallel context editing — a key advantage for agents, copilots, and IDE integrations.


A Historical Echo

Machine learning has cycled through dominant architectures roughly every decade:

  • 2000s: Convolutional Neural Networks (CNNs)
  • 2010s: Recurrent Neural Networks (RNNs)
  • 2020s: Transformers

Each leap offered not just better accuracy, but better compute scaling. Diffusion may be the next inflection point — especially as GPUs, TPUs, and NPUs evolve for parallel workloads.

Skeptics, however, note that language generation’s discrete structure may resist full diffusion dominance. Transformers enjoy massive tooling, dataset, and framework support. Replacing them wholesale won’t happen overnight. But if diffusion proves cheaper, faster, and scalable, its trajectory may mirror the very transformers it now challenges.


What Comes Next

Inception Labs has begun opening Mercury APIs at platform.inceptionlabs.ai, pricing at $0.25 per million input tokens and $1.00 per million output tokens — a clear signal they’re aiming at OpenAI-level production workloads. The Mercury Coder Playground is live for testing, and a generalist chat model is now in closed beta.

If Musk and Ermon are right, diffusion could define the next chapter of AI — one where text, video, and world models share the same generative backbone. And if Mercury’s numbers hold, that chapter may arrive sooner than anyone expects.


Further Reading

  • Stefano Ermon et al., Diffusion Language Models Are Parallel Transformers (Stanford AI Lab)
  • Elon Musk on X, Diffusion Will Likely Dominate Future AI Workloads
  • Inception Labs, Mercury Technical Overview (2025)

Useful Resources for AI

Newsletters/blogs:
– TLDR AI (https://tldr.tech/ai) – Andrew Tan
– Ben’s Bites (https://lnkd.in/gNY8Dmme)
– The Information *paid subscription required (https://lnkd.in/gbkaFbvf)
– Last week in AI (https://lastweekin.ai/)
– Eric Newcomer (https://www.newcomer.co/)

Podcasts:
– No priors with Sarah Guo + Elad Gil (https://lnkd.in/g7Wmr6XT)
– All-in podcast – not AI specific but they talk a lot about it (https://lnkd.in/gH35UeUy)
– Lex Fridman (https://lnkd.in/gjw7zsWX)

Online courses:
DeepLearning.ai by Andrew Ng – https://lnkd.in/gWcn5UTK

Institutional VC writing: 
– Sequoia (https://lnkd.in/g-cKpn8Y)
– A16z (https://lnkd.in/g6JxqwZA)
– Lightspeed (https://lnkd.in/gczzdEcd)
– Bessemer (https://www.bvp.com/ai)
– Radical Ventures (https://lnkd.in/guCe5Mnt); Rob Toews (https://lnkd.in/ggH8HfT8) and Ryan Shannon (https://lnkd.in/gRrBzePx)
– Madrona (https://lnkd.in/gy5D8yNG)

Industry Conferences:
– Databricks Data + AI Summit (https://lnkd.in/gF5QyXYv)
– Snowflake (https://lnkd.in/gavqzw65)
– Salesforce Dreamforce (https://lnkd.in/gJk4r58N)

Academic Conferences:
– NeurIPS (https://neurips.cc/)
– CVPR (https://cvpr.thecvf.com/)
– ICML (https://icml.cc/)
– ICLR (https://iclr.cc/)

Books:
– Genius makers, by Cade Metz (https://lnkd.in/gr_78MB9)
– A Brief History of Intelligence, by Max Bennett (https://lnkd.in/g2uCrPzS)
– The worlds I see, by Fei-Fei Li (https://lnkd.in/gY8Qsvis)
– Chip Wars, by Chris Miller (https://lnkd.in/g6ZAZSCG)

The original author of this post was Kelvin Mu on Linkedin.

Unification of Meta Platforms: Exploring Account Center’s Role

2–3 minutes
  1. Context: Meta’s diverse set of platforms has a singular center for account management, called Account Center.
  2. Why does this matter?: Areas I mean to explore further
  3. Taxonomy of Account Center
  4. Conclusion

Meta offers a unified account management center for Meta Horizons, Instagram, and Facebook.

Meta Account Center

Context:

Meta’s diverse set of platforms has a singular center for account management, called Account Center.

A unified account and identity system is crucial. We all increasingly sync data from one application to another.

Though, I have questions about how identity, security, single sign-on, and seamless connected experiences will be handled in the future. These questions will be charted through multiple posts.

Account Center is focused on accounts on Instagram, Facebook (Big Blue), and Meta Horizon at the moment. Given the backdrop of the metaverse, Meta is building towards… it is in service of their interests to set a high bar on how the unification of platforms will work.

Why does this matter?: Areas I mean to explore further

1. Account Center functions for different platforms like IG, Horizons, or Facebook. But how come other Meta-owned platforms (i.e. WhatsApp) aren’t there yet? When / what will be the expansion of this from Meta into a standard we can all tap on?

2. Potential challenges and benefits of identity management in the metaverse and where turn-key solutions can exist.

3. Real-life scenarios and/or case studies to illustrate the impact of the Account Center on user experience.

Taxonomy of Account Center

The Account Center delivers its services through several key components:

Profiles: Centralized management of user profiles across different platforms.

Connected Experiences: Seamless integration and data synchronization between applications.

Password & Security: Enhanced security measures and password management.

Personal Details: Management of personal information.

Your Information and Permissions: Control over data sharing and permissions.

Ad Preferences: Customization of ad settings and preferences.

Meta Pay: Unified payment system across Meta’s platforms.

Meta Verified: Verification service for enhanced account credibility.

Accounts: Overall management of linked accounts.

Click here for the Meta Accounts Center Taxonomy (Rolling Updates)

Conclusion

Meta’s Account Center and other solutions are a significant step towards unifying user experience. Meta can set a high standard for account management in the emerging metaverse. This can be similar to the post we saw from Zuckerberg evangelizing open-source models. This action will expand its scope and set a high standard for other identity and unification centers.

These future posts will delve into the specifics of these areas. They will provide a comprehensive analysis of the Account Center’s role and potential in the evolving digital landscape.

Please let me know if there are specific details or topics you would like to explore further.

Traversal of Immersive Environments | HoloTile Floor from Disney

If you’re new to The Latent Element, I write about future market development narratives or things of interest to me, hence the name “latent” element. These views are not representative of any company nor do they contain privileged info.

More details and contact info are in the about section.

Post Details

Read time:

3–4 minutes

Table of Contents:

  1. The Challenge
  2. Early Solutions
  3. Disney Research HoloTile Floor
  4. Closing Thoughts
  5. Sources

Subscribe to get access

Read more of this content when you subscribe today.

Emergent abilities in LLMs

Are Emergent Abilities of Large Language Models a Mirage?
Authored by Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo
Computer Science, Stanford University

https://arxiv.org/pdf/2304.15004.pdf

This work challenges the notion of emergent abilities in large language models, suggesting that these abilities are not inherent to the model’s scale but rather a result of the choice of metrics used in research. Emergent abilities are defined as new capabilities that appear abruptly and unpredictably as the model scales up. The authors propose that when a specific task and model family are analyzed with fixed model outputs, the appearance of emergent abilities is influenced by the type of metric chosen: nonlinear or discontinuous metrics tend to show emergent abilities, whereas linear or continuous metrics show smooth, predictable changes in performance.

To support this hypothesis, the authors present a simple mathematical model and conduct three types of analyses:

  1. Examining the effect of metric choice on the InstructGPT/GPT-3 family in tasks where emergent abilities were previously claimed.
  2. Performing a meta-analysis on the BIG-Bench project to test predictions about metric choices in relation to emergent abilities.
  3. Demonstrating how metric selection can create the illusion of emergent abilities in various vision tasks across different deep networks.

Their findings suggest that what has been perceived as emergent abilities could be an artifact of certain metrics or insufficient statistical analysis, implying that these abilities might not be a fundamental aspect of scaling AI models.

Emergent abilities of large language models are created by the researcher’s chosen
metrics, not unpredictable changes in model behavior with scale.

The term “emergent abilities of LLMs” was recently and crisply defined as “abilities that are not
present in smaller-scale models but are present in large-scale models; thus they cannot be predicted
by simply extrapolating the performance improvements on smaller-scale models”. Such emergent abilities were first discovered in the GPT-3 family. Subsequent work emphasized the discovery, writing that “[although model] performance is predictable at a general level, performance on a
specific task can sometimes emerge quite unpredictably and abruptly at scale”.

https://arxiv.org/pdf/2304.15004.pdf

How can mixed reality drive more engagement in movement and fitness?

Fitness is one of the most robust categories under discussion, across Augmented Reality and Virtual Reality devices. For whom does this movement level merit the moniker “fitness”? And what timeline are we working with to see the sweeping adoption of fitness via spatial computing (the term widely known now due to Apple’s Vision Pro announcement and curving of the terms VR / AR / MR collectively)?

I’m seeing new unlocks particularly as it relates to the comfort of the device, spatial awareness afforded due to camera passthrough, and greater respect for ergonomic polish among developers.

The video seen here is a clip taken November 8th, 2023, showing a first-person view of a Quest 3 experience that allows for gestures, hand tracking, and movement to be used as input to an increasing number of games.

The title is built by YUR, the app name is YUR World.

Developer Blog Post: ARKit #1

When developing AR applications for Apple phones there are two cameras that we speak about. One is the physical camera on the back of the phone. The other is the virtual camera that you will have in your Unity scene to in turn, match the position and orientation of the real world camera.

A camera in Unity (virtual) has a component called Clear Flags which determines which parts of the screen will be cleared. On your main virtual camera setting this to “Depth Only” will instruct the renderer to clear the layer of the virtual background environment. Allowing for the seamless overlay of virtual objects on the (physical) camera feed as a backdrop for your virtual objects.

More to come on differences between hit testing and ray casting in the context of ARKit and a broader look at intersection testing approaches in the next post.

Update; Oculus Launch Pad 2017: Future of Farming

Everything is hierarchical in the brain. In VR design for users, my hypothesis is that this can be really helpful for setting the context. For example, at Virtually Live where the VR content is “Formula E Season 2 Highlights”, meaning the one donning the headset is able to watch races. I once proposed that we use the amazing UX of Realities.io to use an interactive model of Earth as the highest level of abstraction from the races (which occur all over the world). The user can spin the globe around and find a location to load in. The hierarchy written abstractly in this example is, Globe is a superset of Countries, Countries that of Cities, and Cities that of Places. I figured that this would be perfect for an electric motorsport championship series that travels to famous cities each month. We went with a carousel design that was more expeditious than the globe in the end.

The Future of Farming
 takes place largely in a metropolitan area, namely San Francisco. So I’ve decided that to begin, I’ll borrow from the hierarchical plan. I want to showcase an orthographic project of San Francisco to the user with maybe a handful of locations highlighted as interactable. To do this I’ve setup WRLD in my project for city landscape.

Upon selection of one of the highlighted locations with the GearVR controller, a scene will load with a focal piece of farming equipment that has made its way into the type of place (e.g. Warehouse, House, or Apartment, etc.).

A quick aside, last week I had a tough travel and work schedule to New York. I came upon a pretty bare blog post upon reading back what I wrote, so I decided, it was better to not share. One of the other hurdles I had, was an unfortunate loss of the teammate I announced two weeks prior, simply due to his prioritizing projects with budgets more appealing to him. I dwelled on this for awhile, as I admired his plant modeling work a lot. With the loss of that collaborator and weighing a few other factors, I’ve decided to pursue an art style much akin to that of Virtual Virtual Reality or that of Superhot. Less geometry all created in VR. Doing most of this via Google Blocks and a workflow involving pushing created environments to Unity which is pretty straight-forward. After you have created your model in Google Blocks, visit on an Internet browser with WebGL-friendly settings and download your model. From there, you can unzip that file and drag it into Unity Assets>Blocks Import which I recommend you create as a way of staying organized. You’ll note that Blocks imports speciate a .mtl, materials folder, and a .obj model usually. In order to have your intended Google Blocks model to show through you need to change one setting called “Material Naming” after you’ve clicked on your .obj. Change it to “By Base Texture Name” and Material Search can be by “Recursive Up”.

Unity_2017_1_1f1_Personal__64bit__-_blocks2unity_unity_-_Blocks_Tutorial_-_Android__Personal___OpenGL_4_1_

 

Here’s a look at the artwork for a studio apartment in SF for the app, as viewed from above. It’s a public bedroom that I’m remixing and you can see I’ve added a triangular floor space for a kitchen and this is likely where the window sill variety of hydroponic crop equipment will go. Modeling one such piece is going to be really fun.

 

 

View from Above

 

room

Angle View

 

 

 

In the past weeks, I’ve dedicated myself to edification on gardening and farming practices via readings, podcasts, and talking to people in business ecosystems involving food product suppliers. I learned about growing shitake mushrooms and broccoli sprouts in the home and got hands on with these. I learned about the technology evolution behind rice cookers and about relevant policy for farmers on the west coast over the last dozen years. In the industry, there are a number of effective farming methods that I’m planning to draw on (indoor hydroponic and aeroponic) that I can see working in some capacity in the home, and milieus I will highlight such as a legitimate vertical indoor farm facility (https://techcrunch.com/2017/07/19/billionaires-make-it-rain-on-plenty-the-indoor-farming-startup/).

I have asked for help from a design consultant standpoint from someone that works at Local Bushel.

To expound on why Local Bushel is perhaps a helpful reference point: Local Bushel is a community of individuals dedicated to increasing our consumption of responsibly raised food. Their values align well with edifying me (the creator) about the marketplace that I want to project into the future about. Those are:

  1. Fostering Community
  2. Being Sustainable and Responsible
  3. Providing High Quality, Fresh Ingredients

——
For interactions, I can start simple and use info-cards/move scenes based on the orientation of the users head using ray casts. Working in Oculus Gear VR Controller eventually.

Project Futures: The Future of Farming

The following is the 1 – 5 paragraph proposal I submitted to Oculus Launchpad 2017. In terms of why you should care about this, I am open to suggestions on what installments to make next.

Project Futures is a virtual reality series that aims to put people right in the middle of a realized product vision. I’ll set out to make a couple example experiences to share from rolling out over the next couple of months. The first will be about the future of farming. Vertical, climate controlled orchards that are shippable to anywhere in the world. 

“His product proposes hydrant irrigation feed vertical stacks of edible crops—arugula, shiso, basil, and chard, among others—the equivalent of two acres of cultivated land inside a climate-controlled 320-square-foot shell. This is essentially an orchard accessible by families in metropolitan settings. People will need help a) envisioning how this fits into the American day b) how to actually use an orchard/garden like this”

Industrial Landscape

Since VR is such an infant technology, if you can communicate your idea introduce your product, using a more traditional method (e.g. through illustration, powerpoint, or video as below) then you probably should.

vertical_farm_2

There are, however, some ideas that are very bad to communicate using traditional methods. That’s why it’s an appealing idea to use VR to introduce product ideas today. Climate controlled vertical farms that are shippable are extremely difficult to conceptualize for the average American. There is real value for the customer; who gets a learning experience fueled by virtual interactions and immersive technology about what it’s like to use one such orchard for grocery shopping. 

Now here’s where my story starts to converge with this idea for the series. I keenly seek out constraints that will allow me to keep healthy and eat healthily. Incredibly, I’m using a service which allows local bay area farms to deliver groceries for the week to my door every Tuesday.  I only order paleo, or rather plant based pairings with a protein, ingredients.

What I want to focus on, is that currently, this service isn’t ready to scale across the nation. I guess, there simply aren’t resources for the same crops in different places among other logistical reasons for not scaling far beyond the bay area. So I thought…. this delivery infrastructure obviously sits atop resources created by farmers. So, to scale this delivery which can be so good for the consumer’s health, well, the infrastructure promise of a shippable orchard can be huge. Conditional on the climate controlled, shippable orchard’s effectiveness, all geographic areas would be addressable markets for such a delivery service.

I would like to empower people across the world to have access to healthy foods. But an important point in this process is a shift in thinking about how this healthy future might exist. VR is a device that I’ve paid close attention to for a couple of years and before I get too far ahead of myself, I will see what I can produce with it to communicate on the idea of the climate controlled shippable orchard. An example of the interaction a user would have is depicted here.

411HRB1dUHL_jpg__500×360_.png

 

As a user puts the tracked controller into the collider of the plant she can spatially pick one of the options (“pluck”, “about”, “farm”).

‘Pluck’ will do exactly what you’d expect, spawning perhaps a grocery bag for the user to place that bit of shiso (or kale) in. ‘About’ would detail more about the crop (i.e. origins and health benefits). ‘Farm’ would articulate the local of optimal growth and known farmers of such a crop.

If you have an idea that you think would slot well into the Project Futures virtual reality series about the future of different products. Ping me at dilan [dot] shah [at] gmail [dot] com as I would love to talk to you about it.