Table of Contents
· Introduction
· Generative Modelling Concept
· Denoising Diffusion Probabilistic Models (DDPM)
∘ Forward Diffusion Process
∘ SDE Formulation
∘ Reverse Diffusion Process
∘ Diffusion Models (DDPM): Training
∘ Diffusion Models (DDPM): Sampling
· Score-Based Diffusion Models
∘ Score Function in the Diffusion
∘ Score-Based Diffusion Models: Training
∘ Score-Based Diffusion Models: Sampling
· Flow Matching
∘ Flow-Matching Models: Training
∘ Flow-Matching Models: Sampling
· Why do untrained diffusion and flow matching models fail so differently?
· Energy-Based Models with Contrastive Divergence
∘ Langevin Dynamics for Sampling
∘ What is Markov Chain Monte Carlo (MCMC)?
∘ Sampling in Energy-based Models (EBMs)
∘ Energy-based Models: Training with Contrastive Divergence
∘ Energy-Based Models: Sampling
· Repository
Introduction
When most people talk about Generative AI today, they are usually thinking about GPTs or text-based prompts. But behind the scenes, there is a second revolution happening in how AI generates images and videos. One that feels less like statistical probability prediction and much more like physics. These models harness core scientific concepts from high school physics to create pictures from scratch.
If you have ever seen an AI create a photo from thin air, you likely heard about Diffusion Models. The concept is borrowed straight from the science lab. Imagine a drop of ink dissolving in a glass of water — the ink gradually dilutes over time. In nature, particles move from high concentration to low concentration until everything becomes a uniform gray mixture. Diffusion models learn to hit the rewind button on that process, taking pure noise and systematically concentrating it back into a sharp and clear image.
While diffusion was a massive breakthrough, it comes with computational costs. Generating a single image requires many sequential denoising steps, making the process slow. This has pushed research in another direction called Flow Matching.

Flow Matching tackles the speed problem directly. It learns to move data along straight-line trajectories that are the shortest path from noise to data instead of gradually adding and removing noise along a curved and complicated path. This makes the vector field simpler to learn and allows generation in far fewer steps. Crucially, these trajectories are defined analytically during training, so the model always has a clean and well-conditioned target to learn from .

Apart from that, diffusion models are still constrained by their fixed noising procedure as they prescribe how to move from noise to data, which leave little room for flexibility.
This is where Energy-Based Models offer a different perspective. Rather than learning a generation path, they learn the shape of the data distribution itself. Real data sits at the bottom of an energy landscape while generation emerges rolling downhill into a stable valley, which is more flexible and general in principle even though slower in practice.
In this article, we explore following four physics-inspired approaches with complete mathematical intuitives and practical implementation details.
- Diffusion Models or Denoising Diffusion Probabilistic Models (DDPM): Reverse the natural process of diffusion, learning to reconstruct clean data from pure noise through iterative denoising.
- Score-Based Diffusion: Uses the score function (gradient of log probability) to directly guide the reverse diffusion process, offering mathematical elegance and computational efficiency.
- Flow Matching: Finds optimal paths between noise and data distributions, enabling faster generation through learned velocity fields.
- Energy-Based Models with Contrastive Divergence: Learn an energy landscape where probable data points lie in low-energy valleys and sampling through Markov Chain Monte Carlo (MCMC).
By the end of this article, you will understand the theoretical foundations, mathematical formulations, and practical implementations of each method, enabling you to work effectively with modern generative AI systems.
Generative Modelling Concept
After studying these four models, I came to the conclusion that while each model uses a different technique to calculate its loss and update its weights, they all share the same underlying goal: transforming a source distribution — a Gaussian distribution N(0, I) — into a target distribution that approximates the real training data (pdata).

This means we need a generative model capable of producing realistic data (image, video, protein, etc.), even though the true underlying data distribution of the real world is unknown. For example, if we want to generate images of houses, we cannot collect every possible house image in existence, but we can gather a representative sample as training data and ask the model to learn the distribution from it. In doing so, the model learns to map random noise to coherent and clean outputs, effectively starting from pure noise and arriving at the kind of image we want to generate.
Denoising Diffusion Probabilistic Models (DDPM)

Diffusion models exploit the reversibility of stochastic processes. The idea derives from observing how a concentrated solution gradually spreads into a more dilute one over time. These models learn to observe this forward diffusion process (add noise to the image), then reverse it starting from complete mixture and gradually reconstructing the original state (clear image).
When this principle is applied to image generation, it closely resembles the traditional process of developing photographic film.
In analog photography, film is submerged in a chemical developer, and over controlled exposure, the faint hidden image slowly materializes into a clear picture.
Diffusion models follow similar logic. They begin with a noisy undeveloped signal and iteratively refine it, reversing the noise-adding process until a coherent image appears.

The core idea lies in learning to reverse a natural diffusion process by taking pure noise and transforming it into coherent data through two stages: the forward diffusion process gradually corrupts data by adding Gaussian noise, and the reverse diffusion process learns to undo this corruption by iteratively denoising toward real data.
Forward Diffusion Process
The forward process is straightforward. The idea is to progressively corrupt clean data x₀ — sampled from the real data distribution pdata — by adding noise at every timestep from t = 0 to t = T, producing a sequence x₁, x₂, …, xT. This continues until, by timestep T, the data has been fully corrupted such that xT is approximately distributed as the Gaussian N(0, I).
The equation to identify this process is as below:

where:

From the equation above, the image at timestep t (x_t) is derived from the image at the previous timestep t−1 (xt-1) with added Gaussian noise.
SDE Formulation
Stochastic Differential Equations (SDEs) play a central role in the diffusion process, as they govern how noise is added to corrupt data in continuous time — where time is treated as continuous rather than discrete.
An SDE can be thought of as a time-dependent vector field that describes how much a data point is corrupted at any given time t as noise is introduced. Essentially, we want to characterise how the image changes as time progresses.
To achieve this, the SDE incorporates a diffusion term grounded in the concept of Brownian motion, where noise is introduced through a Wiener process dWt.

where:

The first term is referred to as the drift coefficient ut(xt), which guides the velocity of the process — essentially describing where we are in the space at time t. The second term is the diffusion term, which can be thought of as introducing small fluctuations along the path, so that rather than following a perfectly smooth trajectory, the process wobbles slightly before arriving at its destination.
When the diffusion term is removed and only the deterministic component remains — dxt = ut(xt)dt — the equation is known as an Ordinary Differential Equation (ODE).
Below is a visualisation of trajectories combining both the drift and diffusion terms:

Reverse Diffusion Process
The reverse process involves converting noisy data back into clean data. It begins with a noisy image at timestep T and progressively removes noise at each timestep until a clean image is recovered at t = 0.

where

and the posterior variance is:

Diffusion Models (DDPM): Training
The core concept of the diffusion models with Denoising Diffusion Probabilistic Models (DDPM) is utlizing forward and reverse diffusion process to denoise the data from noisy to clean data.
It can be done by training the model leanrs to predict amount of noise at each time step t and calculate the loss (Mean Square Error (MSE)) between noise that has been added and the amount of noise the model predicts.
As such, the training objective is:

where

Implementation
The example code below uses a synthetic Mixture of Gaussians (MoG) dataset. Each sample is generated by randomly selecting one of eight centres — (2, 0), (−2, 0), (0, 2), (0, −2), (2, 2), (−2, −2), (2, −2), and (−2, 2) — and adding Gaussian noise with a standard deviation of 0.25. This produces a simple 2D dataset with clear structure, making it ideal for visualising and understanding how a diffusion model learns to generate data.
For the backbone model, it uses a simple fully-connected neural network to predicts noise during the diffusion process.
def sample_mog(n, device):
centers = torch.tensor(
[
[2.0, 0.0],
[-2.0, 0.0],
[0.0, 2.0],
[0.0, -2.0],
[2.0, 2.0],
[-2.0, -2.0],
[2.0, -2.0],
[-2.0, 2.0],
],
device=device,
)
idx = torch.randint(0, centers.size(0), (n,), device=device)
x = centers[idx] + 0.25 * torch.randn(n, 2, device=device)
return x
class NoiseNet(nn.Module):
def __init__(self, dim, hidden, n_layers):
super().__init__()
layers = []
in_dim = dim + 1
for _ in range(n_layers - 1):
layers.append(nn.Linear(in_dim, hidden))
layers.append(nn.SiLU())
in_dim = hidden
layers.append(nn.Linear(in_dim, dim))
self.net = nn.Sequential(*layers)
def forward(self, x, t):
t = t.view(-1, 1)
return self.net(torch.cat([x, t], dim=1))
def train(cfg):
set_seed(cfg.seed)
os.makedirs(cfg.save_dir, exist_ok=True)
model = NoiseNet(dim=2, hidden=cfg.hidden, n_layers=cfg.n_layers).to(cfg.device)
opt = torch.optim.Adam(model.parameters(), lr=cfg.lr)
betas, alphas, alpha_bars = make_schedule(cfg, cfg.device)
for step in range(1, cfg.steps + 1):
x0 = sample_mog(cfg.batch_size, cfg.device)
t_idx = torch.randint(0, cfg.t_steps, (cfg.batch_size,), device=cfg.device)
t = (t_idx.float() + 1.0) / cfg.t_steps
x_t, noise = q_sample(x0, t_idx, alpha_bars)
noise_pred = model(x_t, t)
loss = F.mse_loss(noise_pred, noise)
opt.zero_grad(set_to_none=True)
loss.backward()
opt.step()
if step % 500 == 0 or step == 1:
print(f"step {step} loss {loss.item():.6f}")
torch.save(model.state_dict(), os.path.join(cfg.save_dir, "noise_net.pt"))
return model
Diffusion Models (DDPM): Sampling
Once the model has been trained, we generate new data by sampling from the learned distribution. Below is a comparison of samples before and after training the model.


The trained model progressively transforms samples from the source to target distribution at each time step, whereas untrained models leave samples scattered and unable to converge.
Score-Based Diffusion Models
Unlike DDPM that learns to predict noise, score-based methods learn the score function, which is the gradient of the log-density with respect to x.

This vector field points toward regions of higher probability at each noise level. Given the score, it can solve the reverse SDE or run Langevin dynamics to generate samples.
Score Function in the Diffusion
Under the forward process, the conditional distribution is:

Taking the gradient ∇xt of the log of the q(xt | x0) :

Substituting xt with:

Then, it simplifies to:

Score-Based Diffusion Models: Training

Rather than predicting noise directly, score-based diffusion models learn to predict the score — the gradient of the log probability density. Thus, the training objective is a network sθ(xt, t) optimized via denoising score matching:

This is equivalent to the DDPM objective — minimising the MSE between the predicted score and the true score function at each timestep t.
Reverse Process via Score
The reverse step adds the score (ascending the log-density) rather than subtracting noise

Implementation
def score_target(x_t, x0, t_idx, alpha_bars):
alpha_bar = alpha_bars[t_idx].view(-1, 1)
return -(x_t - torch.sqrt(alpha_bar) * x0) / (1.0 - alpha_bar)
def train(cfg):
set_seed(cfg.seed)
os.makedirs(cfg.save_dir, exist_ok=True)
model = ScoreNet(dim=2, hidden=cfg.hidden, n_layers=cfg.n_layers).to(cfg.device)
opt = torch.optim.Adam(model.parameters(), lr=cfg.lr)
betas, alphas, alpha_bars = make_schedule(cfg, cfg.device)
for step in range(1, cfg.steps + 1):
x0 = sample_mog(cfg.batch_size, cfg.device)
t_idx = torch.randint(0, cfg.t_steps, (cfg.batch_size,), device=cfg.device)
t = (t_idx.float() + 1.0) / cfg.t_steps
x_t = q_sample(x0, t_idx, alpha_bars)
s_target = score_target(x_t, x0, t_idx, alpha_bars)
s_pred = model(x_t, t)
loss = F.mse_loss(s_pred, s_target)
opt.zero_grad(set_to_none=True)
loss.backward()
opt.step()
if step % 500 == 0 or step == 1:
print(f"step {step} loss {loss.item():.6f}")
torch.save(model.state_dict(), os.path.join(cfg.save_dir, "score_net.pt"))
return model
Score-Based Diffusion Models: Sampling


Flow Matching
Flow matching takes a different approach altogether by bypassing iterative denoising. Instead, it learns a velocity field that moves samples from a noise distribution to a data distribution along straight paths — a cleaner objective represented in optimal transport theory.

The difference between the diffusion process and flow matching is that flow matching has no stochastic term, in fact, it has only a drift term. As such, the trajectory is governed entirely by an ordinary differential equation that defines the velocity field at each timestep t.

Conditional Flow Matching
To train the model to find the velocity at each timestep t, it needs to have an expected velocity as ground truth for the model to refer to during training, so the model knows how that trajectory of straight line of each time step t should be.
To do so, it needs to define the straight-line interpolation between noise x0 ∼ N(0, I) and data x1 ∼ pdata with t ranges from 0–1 :

Thus, differentiating xt with respect to t yields the velocity (v):

Flow-Matching Models: Training
A velocity network vθ(xt, t) is trained to match a constant target velocity.

Similarly to diffusion models (DDPM and score-based), the training objective of flow matching models is to compute the MSE loss between the predicted velocity and the ground truth.
By conditioning on matched pairs (x0, x1), the training signal has low variance. This conditional objective represents a tight upper bound on the marginal flow matching loss.
Implementation
def cfm_velocity(x0, x1, t):
t = t.view(-1, 1)
x_t = (1.0 - t) * x0 + t * x1
v = x1 - x0
return x_t, v
def train(cfg):
set_seed(cfg.seed)
os.makedirs(cfg.save_dir, exist_ok=True)
model = VelocityNet(dim=2, hidden=cfg.hidden, n_layers=cfg.n_layers).to(cfg.device)
opt = torch.optim.Adam(model.parameters(), lr=cfg.lr)
for step in range(1, cfg.steps + 1):
x1 = sample_mog(cfg.batch_size, cfg.device)
x0 = sample_base(cfg.batch_size, cfg.device)
t = torch.rand(cfg.batch_size, device=cfg.device)
x_t, v_target = cfm_velocity(x0, x1, t)
v_pred = model(x_t, t)
loss = F.mse_loss(v_pred, v_target)
opt.zero_grad(set_to_none=True)
loss.backward()
opt.step()
if step % 500 == 0 or step == 1:
print(f"step {step} loss {loss.item():.6f}")
torch.save(model.state_dict(), os.path.join(cfg.save_dir, "velocity_net.pt"))
return model
Flow-Matching Models: Sampling


Why do untrained diffusion and flow matching models fail so differently?
Comparing untrained Diffusion models (DDPM and Score-based) and Flow Matching models exposes failure modes that reflect the mathematics underlying each approach.
Diffusion models is built on a Stochastic Differential Equation — a process driven by adding randomness at every timestep. Without a trained denoising network to counteract that noise, samples accumulate error progressively, dispersing into incoherence across the full space.
Flow Matching, by contrast, is governed by an Ordinary Differential Equation — inherently deterministic with no noise injection. Without training, the velocity field carries no signal, so samples simply remain where they start and is effectively frozen.
This asymmetry is the key insight. Diffusion fails actively as its stochasticity compounds into corruption, while Flow Matching fails passively, its determinism producing stagnation (not moving) rather than chaos.
The same properties explain their trained behaviour. A learned Flow Matching model follows smooth and direct paths to the target distribution, while Diffusion models must take many small steps to counteract accumulated noise — making it both slower and more sensitive to the quality of each denoising step.
Energy-Based Models with Contrastive Divergence
Unlike other models that learn a normalised probability, where values are bounded within a fixed range such as 0–1, Energy-Based Models (EBMs) define an unnormalised probability density through a learned energy function Eθ : R^d → R, meaning the output is unbounded. The unnormalised density is shown below, where Zθ is the partition function representing the integral of the energy over all possible states.

The partition function Zθ is generally intractable, as integrating over all possible configurations is computationally infeasible, and this is the central challenge of training EBMs.
What the model does capture is a relative ordering: low energy corresponds to high probability, reflecting regions where data is densed, while high energy corresponds to low probability.
Langevin Dynamics for Sampling
Langevin dynamics plays a central role in training EBMs, as contrastive divergence requires sampling from the model distribution (pθ) during training.
To understand how this sampling works, it is important to first understand Markov Chain Monte Carlo (MCMC).
What is Markov Chain Monte Carlo (MCMC)?
MCMC is a class of sampling algorithms used to draw samples from distributions that cannot be sampled from directly.
It works by constructing a Markov chain — a sequence of samples where each sample depends only on the previous one — such that the chain’s stationary distribution matches the target distribution. By running the chain long enough, the produced samples are effectively drawn from the target distribution.
Sampling in Energy-based Models (EBMs)

The equation above is a discretized form of Langevin dynamics, and it neatly combines two forces: deterministic optimization and stochastic exploration.
Starting from the current state xk, the second term moves the sample in the direction of lower energy by following the negative gradient of the energy function ∇xEθ(xk).
Since lower energy corresponds to higher probability under the model

this step nudges the sample toward more likely regions of the data distribution — much like rolling downhill on an energy landscape.
However, relying on this term alone would quickly trap the process in local minima. That’s where the third term comes in— a Gaussian noise component, where ηk∼N(0,I) denotes a random vector drawn from a standard multivariate normal distribution (zero mean and identity covariance) of each step k scaled by ε
This injects unbiased randomness at every step and allows the chain to explore the space more broadly.
Together, these two terms create a balance between exploitation and exploration.
Under mild regularity conditions and with a sufficiently small step size ε, the Markov chain defined by this update converges to the target distribution pθ.
In other words, after enough iterations, xk behave as if they were drawn from the model itself. However, the choice of ε is critical.
If it is too large, the discretization becomes unstable — the updates may overshoot, and the chain can diverge instead of settling into the desired distribution. On the other hand, if ε is too small, each step makes only minimal progress, leading to extremely slow mixing and inefficient sampling.
This delicate trade-off is why practical implementations often rely on carefully tuned step sizes or annealing strategies, gradually reducing ε over time to first encourage exploration and then refine samples in high-probability regions.
Energy-based Models: Training with Contrastive Divergence

Since the partition function Zθ is intractable, directly maximizing log pθ(x) is infeasible. Thus, Contrastive Divergence plays a role to approximae the gradient by pushing energy down on real-data points and push energy up on incompatible data sampled from the distribution using K-step Langevin samples.

The key idea of the contrastive loss is to shape the energy landscape by pushing it down at ground truth data points (pdata) and pushing it up at model samples (pθ), which are not real data. This loss is then used to update the model.
The second term samples from pθ of step K, where θ represents the current model weights during training. Since the model is still being trained, the samples drawn from pθ are not real data, which means they are what the model currently believes the data should look like.
As K → ∞, the gradient of this objective converges to an unbiased estimator of the log-likelihood gradient. In practice, CD-1 is computationally cheap but introduces bias, while K ∈ [20, 100] yields better approximations at greater cost.
Why do we need sampling during training in Energy-based models?
Energy-Based Models (EBMs) work by modeling an unnormalized distribution through a contrastive learning process, which is pushing energy down for real data while pushing energy up for data sampled from the model (contrastive divergence). This means that data sampled from an undertrained model may not have fully converged to the true distribution.
That is the reason the direct sampling is not possible and MCMC is required.
In summary, the training gradient in EBMs has two terms:
Real Data → pushes energy down
Model samples → push energy up
MCMC is needed to obtain those model samples because there is no closed-form way to sample directly from pθ.
Implementation
def sample_mog(n, device):
centers = torch.tensor(
[
[2.0, 0.0],
[-2.0, 0.0],
[0.0, 2.0],
[0.0, -2.0],
[2.0, 2.0],
[-2.0, -2.0],
[2.0, -2.0],
[-2.0, 2.0],
],
device=device,
)
idx = torch.randint(0, centers.size(0), (n,), device=device)
x = centers[idx] + 0.25 * torch.randn(n, 2, device=device)
return x
def langevin_step(x, model, step_size, noise_scale):
x = x.detach().requires_grad_(True)
energy = model(x).sum()
grad = torch.autograd.grad(energy, x)[0]
x = x - 0.5 * step_size * grad
x = x + noise_scale * torch.randn_like(x)
return x.detach()
def sample_negative(model, cfg, init=None):
if init is None:
x = torch.randn(cfg.batch_size, 2, device=cfg.device)
else:
x = init
for _ in range(cfg.cd_k):
x = langevin_step(x, model, cfg.langevin_step_size, cfg.langevin_noise_scale)
return x
def train(cfg):
set_seed(cfg.seed)
os.makedirs(cfg.save_dir, exist_ok=True)
model = EnergyNet(dim=2, hidden=cfg.hidden, n_layers=cfg.n_layers).to(cfg.device)
opt = torch.optim.Adam(model.parameters(), lr=cfg.lr)
for step in range(1, cfg.steps + 1):
x_pos = sample_mog(cfg.batch_size, cfg.device)
x_neg = sample_negative(model, cfg)
energy_pos = model(x_pos).mean()
energy_neg = model(x_neg).mean()
loss = energy_pos - energy_neg
opt.zero_grad(set_to_none=True)
loss.backward()
opt.step()
if step % 500 == 0 or step == 1:
print(f"step {step} loss {loss.item():.6f}")
torch.save(model.state_dict(), os.path.join(cfg.save_dir, "energy_net.pt"))
return model
Energy-Based Models: Sampling


From the visualization above, the yellow gradient represents high energy (low probability), while the dark blue gradient represents low energy (high probability). The light blue cluster dots are Gaussian noise, and the red dots represent real data. In the untrained model, energy remains high around the real data regions, preventing noise samples from converging to the actual distribution. However, after training, the model successfully guides noise samples to the real data distribution where energy is low.
Repository
A working implementation of the concepts covered in this article can be found in the generative-modelling repository.
REFERENCES
- Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. arXiv. https://arxiv.org/abs/2006.11239
Physics-Inspired Generative Modeling: Diffusion, Flow Matching, and Energy-Based Models was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.