Flow Matching: The Journey After Diffusion

The Itch Diffusion Leaves Behind

Recall the end of the diffusion journey. We built a forward SDE to drown data in noise, learned the score of every noised marginal, derived the reverse SDE, and then discovered the probability flow ODE: a deterministic velocity field whose smooth trajectories carry noise to data with exactly the right distribution. DDIM, the fast sampler everyone actually uses, is an integrator of that ODE.

Now look at the construction with fresh eyes. The thing we use at the end is a velocity field. The things we built to get it: a stochastic process, a score ladder, an ELBO, reverse time theorems. The entire stochastic apparatus was scaffolding around an object that is, in the end, just arrows telling points where to move.

Flow matching is the act of removing the scaffolding. Choose the path you want probability mass to follow from noise to data. Write down the velocity field of that path. Regress a network onto it. Done. No SDE, no score, no ELBO, and, as we will see, with the freedom to choose straight paths, which diffusion never had.

The Long Way and the Shortcut to the Same Object

Top route, the diffusion blog: forward SDE, score matching, reverse SDE, and finally the probability flow ODE. Bottom route, this blog: declare the path, derive its velocity, regress. Both end at a velocity field a network can learn. The shortcut also unlocks something the long way could not: the path is now a design choice, and the best choice turns out to be a straight line.

The Object: Probability as a Fluid

First, meet the object properly. A velocity field v(x, t) assigns an arrow to every point in space at every time. Drop a particle anywhere and it must follow the arrows:

\frac{d}{dt}\,\phi_t(x) = v\big(\phi_t(x),\,t\big),\qquad \phi_0(x) = x

φ_t is the flow map: where a particle that started at x has been carried by time t. An ODE, the same species as diffusion's probability flow.

Now drop not one particle but an entire distribution of them, p₀. The flow carries the whole cloud, and the density at time t is the pushforward p_t = [φ_t]_* p₀. The deep law governing this is borrowed straight from fluid dynamics, the continuity equation:

\frac{\partial p_t(x)}{\partial t} \;+\; \nabla \cdot \big(p_t(x)\, v(x,t)\big) \;=\; 0

Probability mass is an incompressible bookkeeping fluid: it is never created or destroyed, only transported by v. Density change at a point equals net inflow. This single PDE is the contract between a velocity field and the densities it generates.

So the generative recipe is: find a field v whose flow carries an easy p₀ (Gaussian noise) to the data distribution p₁ at t = 1, then sample by integrating the ODE. This idea is older than flow matching: continuous normalizing flows (2018) trained exactly this object by maximum likelihood. Their curse was the training cost: evaluating the likelihood requires simulating the entire ODE and its divergence for every training example. Beautiful theory, brutal compute. Flow matching exists to keep the object and delete the simulation.

A Cloud Carried by Arrows

Three snapshots of one flow. The same particles, carried by the velocity field from a Gaussian blob at t = 0, through a stretched intermediate at t = 0.5, into a two island data distribution at t = 1. The arrows are v(x, t) at that moment. The continuity equation is what guarantees the moving cloud is a valid probability density at every instant.

The Dream Objective, and Why It Is Impossible

Here is the loss we would love to optimize. Pick any probability path p_t you like, connecting noise p₀ to data p₁, with u_t(x) the true velocity field that generates it. Then just regress:

\mathcal{L}_{FM}(\theta) = \mathbb{E}_{t,\;x \sim p_t}\Big[\big\|\,v_\theta(x, t) - u_t(x)\,\big\|^2\Big]

Flow Matching, the dream version: match the network to the true field, everywhere, at every time.

One problem: we cannot evaluate u_t(x). The marginal velocity at a point depends on the entire data distribution at once, the same way diffusion's marginal score did. We know where individual samples are, not what the collective flow of the whole distribution looks like at an arbitrary point in space. The dream loss has an unknowable regression target.

If this feels familiar, it should. Diffusion hit the identical wall: the marginal score was intractable, and the rescue was Vincent's theorem, denoising score matching, which replaced the unknowable marginal with a computable per sample conditional. Flow matching is about to make exactly the same move, and the parallel is not an analogy, it is the same mathematics.

The Miracle: Conditional Flow Matching

The rescue (Lipman et al., 2023). Stop trying to describe the whole river at once. Instead, describe the journey of one data point at a time. Condition on a single data sample x₁ and choose a simple conditional path p_t(x | x₁), a little moving cloud that starts inside the noise at t = 0 and lands on x₁ at t = 1. For a simple cloud, its conditional velocity u_t(x | x₁) is known in closed form. Then train on the conditional target:

\mathcal{L}_{CFM}(\theta) = \mathbb{E}_{t,\;x_1 \sim q,\;x \sim p_t(\cdot|x_1)}\Big[\big\|\,v_\theta(x, t) - u_t(x \mid x_1)\,\big\|^2\Big]

Every term is now samplable and computable. No simulation, no score, no ELBO.

And now the theorem that makes the whole field work. The marginal velocity is exactly the conditional expectation of the per sample velocities passing through a point:

u_t(x) = \mathbb{E}\big[\,u_t(x \mid x_1)\;\big|\; x_t = x\,\big] = \int u_t(x|x_1)\,\frac{p_t(x|x_1)\,q(x_1)}{p_t(x)}\,dx_1

At any point in space, the river's true flow is the weighted average of every individual journey currently passing through that point.

And least squares regression has a famous property: the minimizer of E‖v_θ(x) − Y‖² over functions of x is the conditional mean E[Y | x]. Expand both losses and the cross terms match through exactly this identity, giving:

\nabla_\theta\, \mathcal{L}_{CFM}(\theta) \;=\; \nabla_\theta\, \mathcal{L}_{FM}(\theta)

The two objectives differ by a constant independent of θ. Training on cheap per sample targets IS training on the intractable marginal. This is the flow matching twin of Vincent 2011.

The Marginal Field Is an Average of Journeys

Several individual journeys, each from a noise draw to its own data point, pass through the neighborhood of one query point x. Each contributes its own conditional arrow. The true marginal velocity at x, the bold arrow, is their weighted average. The network never sees the average during training. It sees one random arrow at a time, and least squares regression converges to the average automatically.

Choose the Simplest Path: a Straight Line

The theorem holds for any conditional path. So choose the simplest one imaginable. Connect a noise draw x₀ to a data point x₁ with linear interpolation (note the convention flip from the diffusion blog: here t = 0 is noise and t = 1 is data):

x_t = (1-t)\,x_0 + t\,x_1 \qquad\Longrightarrow\qquad u_t(x_t \mid x_0, x_1) = \frac{d x_t}{dt} = x_1 - x_0

The conditional velocity is the displacement vector, constant along the entire segment. No schedules, no ᾱ, no square roots.

This is the rectified flow / optimal transport path, and it turns training into something almost embarrassingly simple:

\begin{aligned}&\textbf{repeat: } x_1 \sim \text{data},\;\; x_0 \sim \mathcal{N}(0, I),\;\; t \sim \mathcal{U}[0,1]\\&\quad x_t = (1-t)\,x_0 + t\,x_1\\&\quad \text{gradient step on } \big\|\,v_\theta(x_t, t) - (x_1 - x_0)\,\big\|^2\end{aligned}

Compare with DDPM's training box from the diffusion blog. Same skeleton, but the blend is a lerp and the target is a subtraction.

The running worked example returns. The same pixel from the diffusion blog, data value x₁ = 1.0. Draw noise x₀ = −0.4, draw t = 0.5. The training input is x_t = 0.5 × (−0.4) + 0.5 × 1.0 = 0.3. The regression target is x₁ − x₀ = 1.0 − (−0.4) = 1.4, and it would be 1.4 at every t along this segment. If the network outputs 1.3, the loss is (0.1)² = 0.01. That is the entire training step.

Sampling is numerical ODE integration, Euler in the simplest case:

x \;\leftarrow\; x + \Delta t \cdot v_\theta(x, t),\qquad \text{from } x \sim \mathcal{N}(0,I) \text{ at } t=0 \text{ to } t=1

10 to 50 steps in practice, or any off the shelf higher order solver. There is no noise to inject because there is no SDE.

The Whole Recipe in One Picture, Numbers Included

Noise draws on the left, data points on the right, each training pair joined by a straight segment with one constant velocity arrow along it. The highlighted segment is the worked example: from −0.4 to 1.0, midpoint 0.3, target velocity 1.4 everywhere on the segment. Training picks random points on random segments and regresses on the segment's arrow.

BUT WAIT Different training segments cross each other. At a crossing point the network is told two different velocities for the same (x, t). How can one field satisfy both? ▶

It cannot, and it is not supposed to. This is the most instructive confusion in flow matching, and the answer was already planted in §4.

When two segments pass through the same point at the same time, the network receives contradictory targets there, sometimes the arrow of segment A, sometimes the arrow of segment B. Least squares regression under contradictory targets does not pick a winner. It converges to the conditional mean of the targets, the average arrow. And by the §4 identity, that average is precisely the true marginal velocity u_t(x). The contradiction is not a bug being tolerated. The averaging IS the mechanism that assembles the intractable marginal field out of cheap conditional pieces.

The price is geometric: where conditional segments cross, the averaged marginal flow must bend to stay a valid single valued field. So even though every training path was straight, the learned flow's trajectories are curved near crossings. Deterministic ODE trajectories cannot cross each other (uniqueness of solutions), so the marginal flow weaves smoothly where the training segments collided.

This single observation explains the next section: fewer crossings means straighter marginal flows means fewer solver steps. The whole rectification program is a war on crossings.

Crossings, Curvature, and Rectification

So the learned flow is straight exactly where training segments did not fight. With the default independent coupling (any noise paired with any data point), fights are everywhere: segments from all over the noise blob to all over the data manifold crisscross constantly, and the marginal trajectories come out gently curved. Curved trajectories need more Euler steps. The straightness we chose for the conditionals did not fully survive the averaging.

Rectified flow (Liu et al., 2022) fixes the coupling instead of the path. Train a first flow. Then generate pairs with the model itself: integrate noise x₀ forward to its output x̂₁, and keep (x₀, x̂₁) as a new training pair. These pairs are produced by non crossing ODE trajectories, so retraining on them, the reflow step, yields a dramatically straighter field. Each reflow provably does not increase transport cost and straightens trajectories, until one or two Euler steps suffice. This is the lineage behind one step image generators.

Why Marginals Bend, and How Reflow Unbends Them

Left: independent coupling. Two straight conditional segments cross; at the crossing the field must average them, so the actual marginal trajectories (solid) bend around each other. Right: after reflow, the pairing is rewired by the model's own non crossing trajectories, segments no longer fight, and the marginal flow is nearly straight: one big Euler stride lands on target.

The Family Reunion: Diffusion Is a Flow Matching Choice

Now the moment the two blogs merge. Flow matching let us choose the conditional path. What if, instead of a straight line, we choose a Gaussian path with a schedule, the little cloud shrinking onto the data point:

p_t(x \mid x_1) = \mathcal{N}\big(\alpha_t\, x_1,\; \sigma_t^2 I\big) \qquad\Longrightarrow\qquad u_t(x \mid x_1) = \frac{\dot\sigma_t}{\sigma_t}\big(x - \alpha_t x_1\big) + \dot\alpha_t\, x_1

The general Gaussian conditional velocity. Every diffusion schedule is one choice of (α_t, σ_t).

Plug in the variance preserving schedule from the diffusion blog (α_t = √ᾱ, σ_t = √(1−ᾱ), reparameterized to continuous time) and the field you obtain is exactly the probability flow ODE of §8 of the diffusion blog. DDPM, score matching, DDIM: all of it is flow matching with one particular curved Gaussian path. The straight OT path is simply a different, and better behaved, point in the same design space. Sanity check the limiting cases of the straight path: α_t = t, σ_t = 1 − t gives u_t(x|x₁) = (x₁ − x)/(1 − t), and substituting x = (1−t)x₀ + t x₁ collapses it to x₁ − x₀, our constant arrow.

The dictionary between the languages is linear, extending the diffusion blog's table: for Gaussian paths, score, noise prediction, clean image prediction, and velocity are interconvertible: v = (σ̇/σ) x + σ²(σ̇/σ − α̇/α)·s(x, t) up to schedule constants, and ε, x̂₀, v parameterizations are affine re-labelings of one another. One object, now five lenses.

One Design Space: Diffusion's Arc and Flow Matching's Chord

The plane of (signal scale α_t, noise scale σ_t). Every generative path is a curve from the noise corner (α=0, σ=1) to the data corner (α=1, σ=0). Diffusion's variance preserving schedule walks the curved arc α² + σ² = 1. Flow matching's OT path takes the straight chord α + σ = 1. Same corners, same theorems, different geometry, and the chord is the one you can integrate in a few strides.

BUT WAIT If Gaussian path flow matching reproduces diffusion exactly, is flow matching just diffusion with a different schedule, or is it genuinely more general? ▶

On the overlap, they are the same theory, and that is a feature: every diffusion result transfers. But the flow matching formulation is strictly larger in three real ways.

The source can be anything. Diffusion's forward process must end at a Gaussian, by construction of the noising chain. Flow matching only needs samples from p₀ and p₁. Transport one image distribution to another, sketches to photos, low resolution to high, molecules to molecules. Noise is just the most common choice of p₀, not a requirement of the theory.

The path can be anything satisfying the boundary conditions. Gaussian paths are one family. Straight lines, optimal transport couplings, paths confined to manifolds or constraint sets, and the general stochastic interpolant x_t = α_t x₁ + β_t x₀ + γ_t z (Albergo and Vanden Eijnden) all fit the same CFM theorem. Diffusion occupies one slice of this space.

The coupling can be learned. Diffusion pairs every data point with independent noise, always. Flow matching can rewire who is paired with whom, minibatch OT coupling, reflow pairs, leading to straighter fields and fewer steps. That degree of freedom simply does not exist in the SDE story.

The honest summary: flow matching did not refute diffusion. It found the coordinate system in which diffusion is one curve among many, and then picked a better curve.

The Deeper Theory: Optimal Transport and the Cost of a Flow

One more level down, for the expert shelf. Among all velocity fields that transport p₀ to p₁, which is the best one? Fluid dynamics has owned this question since Benamou and Brenier (2000): rank fields by their kinetic energy, the total effort of moving the probability fluid:

\mathcal{K}(v) = \int_0^1 \!\! \int \big\|v(x,t)\big\|^2\, p_t(x)\; dx\; dt \qquad \text{minimized} \;\Longleftrightarrow\; \text{dynamic optimal transport}

The minimizer is the optimal transport flow: every particle travels in a straight line at constant speed, no wasted motion, and the total cost equals the squared Wasserstein distance W₂(p₀, p₁)².

This is the theoretical north star behind everything in §5 and §6. The OT flow is the straightest possible bridge. Per sample straight segments are the right local imitation of it. The reason independent coupling falls short is now precise: straightness of each segment is necessary but not sufficient, the pairing also has to be transport efficient, otherwise segments fly across the whole space and collide. Hence the two practical upgrades:

Minibatch OT coupling. Inside each training batch, solve a small optimal transport matching between the batch of noise draws and the batch of data points, then connect matched pairs. Segments become short and nearly parallel, crossings plummet, the learned field straightens, all for the cost of a tiny assignment problem per batch.

Reflow as cost descent. Each rectification round provably does not increase the kinetic energy of the flow and straightens trajectories. Reflow is a fixed point iteration crawling toward the OT flow, with the one step generator as the prize at the bottom.

Pairing Is the Hidden Half of Straightness

The same noise batch and data batch, paired two ways. Left, independent coupling: long segments thrown across the whole space, a thicket of crossings, high kinetic energy. Right, minibatch OT coupling: each noise point matched to a nearby data point, short parallel segments, few crossings, kinetic energy near the W₂ optimum. The conditional paths are straight in both pictures. Only the right one lets the marginal flow stay straight too.

Flow Matching Today, and the Whole Journey

The industry verdict came fast. Stable Diffusion 3 and Flux are rectified flow transformers, trained with the lerp and the subtraction of §5, with one practical refinement: t is sampled from a logit normal distribution concentrating effort on middle times, where the velocity is hardest. Meta's Movie Gen and the Voicebox and Audiobox line are flow matching, as are leading molecule and protein generators on manifolds. The reasons are the ones this blog earned: a five line training loop, no schedule zoo, stable optimization, few step sampling out of the box, and classifier free guidance carrying over verbatim (guide the velocity exactly as you guided the score).

	Diffusion (DDPM lineage)	Flow matching (RF lineage)
Training target	noise ε in a corrupted sample	displacement x₁ − x₀ on a segment
Blend	√ᾱ x₀ + √(1−ᾱ) ε, schedule required	(1−t) x₀ + t x₁, a lerp
Theory engine	ELBO, denoising score matching, reverse SDE	continuity equation, CFM theorem, OT
Sampling	SDE or probability flow ODE, curved	ODE, near straight, 10 to 50 steps, 1 after reflow
Source distribution	must be Gaussian	anything you can sample
Coupling	independent, fixed	free: independent, minibatch OT, reflow
Relationship	diffusion = flow matching on the Gaussian arc; the chord is the upgrade

The journey, end to end

Stand back and the two blogs tell one story. The VAE built a bridge from noise to data in a single learned leap and paid in blur. Diffusion stretched the leap into a thousand stochastic whispers, discovered the score hiding in its noise predictor, took the continuum limit, and found, at the very bottom, a deterministic river: the probability flow ODE. Flow matching then asked the question the whole construction was begging for: if the river is the product, learn the river. Declare a path, average cheap per sample arrows into the true field by nothing more than least squares, choose the path straight, fix the pairing, and the thousand whispers become a handful of strides.

The one paragraph summary

Flow matching trains a velocity field by regression: interpolate a noise draw toward a data point, ask the network for the displacement vector, repeat. The conditional flow matching theorem guarantees this cheap per sample regression has the same gradients as matching the intractable marginal field, because least squares averages crossing targets into exactly the marginal velocity. Diffusion is the special case of a curved Gaussian path; the straight path plus transport aware pairing yields straighter flows, few step samplers, and, through reflow, one step generation. Probability is a fluid, the model is its river bed, and training is nothing but pointing arrows from noise to data.

Flow Matching: Learn the River, Skip the Storm