How diffusion models work and why every AI builder should understand them

Executive overview

Most AI builders know diffusion as the technology behind image generation. It is far more general: a single training procedure that can learn any high-dimensional data distribution — proteins, weather, robotics policies, text — with surprisingly little data.

The core operation is trivial to state: add noise to data, then teach a model to reverse it. The hard-won insight is that predicting a global velocity between noise and data (flow matching) reduces the entire training loop to about five lines of code, is more stable, and outperforms earlier objectives.

The simplest formulation of the most powerful ML framework alive is just: minimize the loss between predicted velocity and actual velocity — noise minus data.

What diffusion is and how it works

  • Learns any data distribution P(data) for any domain, given enough data.
  • Particularly strong at mapping high dimensions to high dimensions in low-data regimes — e.g. 30 images is enough to generate new ones.
  • Forward process: repeatedly add noise to a sample until it becomes random static.
  • Reverse process: train a model to undo that noise, step by step.
  • The noise schedule (beta schedule) controls how much noise is added at each step; a linear schedule is unstable — a sigmoid-shaped cumulative schedule keeps error roughly constant per step.
  • Architecture is fully decoupled: the denoising model can be a UNET, transformer, RNN, or anything else.

Evolution of training objectives

  • Original 2015 paper (Sohl-Dickstein): predict x(t-1) from x(t) using KL divergence — correct but verbose.
  • Predicting the added noise proved easier for the model to learn.
  • Predicting velocity (noise minus data) was easier still and more stable.
  • Flow matching (Meta, Yaron Lipman): replace the circuitous noising path with a straight-line global velocity between noise and data; the objective collapses to five lines of code.
  • FID scores improved steadily with each shift; code got shorter as the math got cleaner — the opposite of typical ML progress.

Flow matching in practice

  • Training: sample an image, sample Gaussian noise, pick a timestep t; x(t) = t × data + (1−t) × noise; velocity = noise − data; minimize |predicted velocity − velocity|.
  • The velocity is time-independent: the model always points in the same direction regardless of where in the schedule it is.
  • Inference: start from random noise, apply Euler steps using the model's predicted velocity, iterate until done.
  • Key constraint: the number of inference steps must match the number trained on; you cannot step beyond the trained schedule without retraining (distillation can compress steps but requires retraining at the new count).
  • The training loop is domain-agnostic — identical code works for images, proteins, weather, trajectories.

Applications across domains

  • Image and video generation: Stable Diffusion, Midjourney, Sora, Veo, Flux, SD3.
  • Protein structure prediction: AlphaFold 2/3, DiffDock (small-molecule binding).
  • Robotics: diffusion policy enables dexterous manipulation; seen as a likely enabler for home robots.
  • Weather forecasting: GenCast is currently the most accurate system in the world.
  • Diffusion LLMs: a major research theme at NeurIPS 2025, both continuous and discrete variants.
  • Current holdouts where diffusion has not yet outperformed: autoregressive LLMs and game-tree search (AlphaGo-style MCTS).

Diffusion vs. autoregressive LLMs — the squint test

  • LLMs emit one token at a time, never revise, and have three discrete training phases; the architecture is a monolithic stack.
  • Brains are massively recursive, operate on concepts rather than tokens, revise continuously, and use a single learning procedure throughout.
  • Diffusion provides two brain-like properties LLMs lack: structured use of randomness, and the ability to emit a full chunk of thought and revise it iteratively.
  • Flow matching in particular decouples concept generation from token-level decoding.

Advice for founders and researchers

  • Training models: treat diffusion as a default candidate for any ML application, even if only to produce a latent space for downstream training.
  • Not training models: update your prior on the rate of improvement — image generation improved roughly 1000x in five years; the same trajectory is playing out in proteins, robotics, and DNA.
  • Foundational bets: robotic home assistants, protein/DNA/metabolomics models, and diffusion-based code generation are all on trajectories that appear tractable.

More like this — when you're ready for early access.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Get early access to the full library.

Join the waitlist for a personal account and content recommendations based on what you're working on.

No spam. Unsubscribe at any time.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.

Be among the first to get personalised recommendations tailored to your stage in business.

No spam.

You're on the list. We'll be in touch before launch.