AI & ML Advanced By Samson Tanimawo, PhD Published Dec 24, 2026 5 min read

Diffusion Models for Images

Image generation runs on diffusion: a model that learns to reverse a noise-adding process. The math is elegant; the engineering is what makes it work at production quality.

The diffusion idea

Diffusion models generate images by iteratively denoising from random noise. Train: start with real image, add noise step by step until pure noise; teach model to predict the noise at each step. Generate: start with pure noise; reverse the process; the model removes noise step by step until a coherent image emerges. The framework underlies Stable Diffusion, DALL-E 3, Midjourney v6, and most state-of-the-art image generators.

The forward process. Start with image x_0. Add Gaussian noise: x_1 = noisy version, x_2 = noisier, ..., x_T = pure noise. The forward process is fixed, the noise schedule. The model isn't trained on this; it's just the data preparation.

The reverse process. Train the model to predict the noise at each timestep given the noisy image. At inference: start with noise, predict noise, subtract predicted noise, repeat. The accumulated subtractions converge to a clean image.

The "why does this work" intuition. The model learns to gradually move the noisy image closer to the data distribution. Each denoising step is a small move; many steps produce a long trajectory to a clean image. The model doesn't memorise images, it learns the structure of the data distribution and samples from it.

The improvements over GANs. Diffusion models are easier to train than GANs (no mode collapse, no adversarial dynamics). Quality is now competitive or better. Most image generation has shifted from GAN-based to diffusion-based as a result.

Architectures

Two main architectural variants. Pixel diffusion: model operates directly on pixels. Latent diffusion: model operates on a compressed latent representation; a separate VAE encodes/decodes between pixels and latents. Latent diffusion is much more compute-efficient (smaller representation = less compute per step) and is the dominant approach in modern systems.

The latent advantage. A 512×512 image has 256K pixels. The same image's latent might be 64×64×4 = 16K dimensions. The model operates on 16K-dim latents instead of 256K-dim pixels, 16x less compute per step. Latent diffusion is the breakthrough that made high-resolution image generation tractable.

The U-Net backbone. Most diffusion models use U-Net architectures, convolutional encoders that downsample then decoders that upsample, with skip connections. U-Nets are good at preserving spatial structure while processing multi-scale features. Recent models add transformer blocks (DiT, Diffusion Transformers) which scale better.

The cross-attention for text. To generate from text prompts, models add cross-attention layers that condition image generation on text embeddings. The text encoder (CLIP, T5) produces embeddings; the diffusion model attends to them at each step. The text-image alignment is what makes prompt-following work.

The DiT shift. Diffusion Transformers replace U-Net with pure-transformer architectures. Scale better; produce higher quality at large parameter counts. State-of-the-art models in 2026 (SD3, Imagen 3, Sora's image stages) increasingly use DiT or DiT-derived architectures.

Production models

The 2026 lineup:

Stable Diffusion 3 (Medium and Large), open-weights, customisable, dominant for self-hosted.
Flux.1, open-weights from Black Forest Labs; high quality; LoRA-friendly.
DALL-E 3, OpenAI's; tight prompt following; integrated with ChatGPT.
Imagen 3, Google's; high quality; strong text rendering.
Midjourney v6, closed; aesthetic quality leader; Discord-based UX.

The open vs closed split. Open weights (SD3, Flux): self-host, fine-tune, customise. Closed (DALL-E, Imagen, Midjourney): turnkey API; higher quality on average but no customisation. The choice depends on whether you need customisation; for many production uses, closed APIs are simpler.

The fine-tuning / LoRA story. Open-weights models support fine-tuning for specific styles, characters, or domains. LoRA-based fine-tuning is cheap (a few hundred dollars per model). The customisation is what made Stable Diffusion the default for art-generation companies.

The text rendering capability. Generating images with embedded text (signs, labels, titles) was a known weakness. 2024-2026 models substantially improved here; Imagen 3, SD3 Large, and DALL-E 3 all handle text reasonably. Specialised text-rendering still helps for high-quality typography.

The aesthetic-quality stratification. Midjourney leads on out-of-the-box aesthetic quality. SD3 Large and Flux are competitive but require more prompting skill. DALL-E 3 has the tightest prompt following. Match model to use case: tight prompts → DALL-E; aesthetic punch → Midjourney; customisation → SD3/Flux.

Practical controls

Beyond text prompts, modern diffusion models support:

ControlNet, condition on edge maps, depth, pose, scribbles. Powerful for guided generation.
IP-Adapter, condition on reference images for style or composition.
Inpainting, fill in masked regions while preserving context.
Outpainting, extend images beyond their borders.
Image-to-image, start from an existing image and modify.

The ControlNet revolution. Vanilla text-to-image is "tell the model what to draw". ControlNet adds "tell the model what shape to draw it in". Pose control, depth control, edge control all let you specify composition while letting the model handle aesthetics. Production design workflows lean heavily on ControlNet.

The IP-Adapter pattern. Provide reference images; the model conditions on their style or composition. Useful for "make this image but in the style of these references". Fast; doesn't require fine-tuning. Dominates style-transfer use cases.

The inpainting workflow. Mask a region; the model regenerates only that region while preserving context. Production photo editing, object removal, content aging, all built on inpainting. Quality has improved enough that "Photoshop-quality edits" via inpainting are reachable.

The img2img variation. Start from an existing image; apply text prompt as modification. Strength parameter controls how much of the original to preserve. Useful for style transfer, variations, iterative refinement. The compose-by-iteration UX is increasingly dominant.

Common antipatterns

Generating from scratch when you should img2img. Iterative refinement converges faster than re-rolling. Use the previous output as input.

Skipping ControlNet for layout-sensitive use. If you need specific composition, ControlNet gets you there faster than prompt engineering.

Closed-API for high-volume customised generation. Cost adds up; open weights with LoRA is usually better economics at scale.

Believing benchmark scores for aesthetic quality. Aesthetic quality is subjective; benchmarks miss the point. User testing on YOUR use case is what matters.

What to do this week

Three moves. (1) For one image generation use case, prototype with two different models (e.g., DALL-E 3 + SD3). The quality differences for YOUR use case will surprise you. (2) If your use case has consistent layout/composition needs, build a ControlNet pipeline. The control over composition is transformative for product workflows. (3) If you're at high volume, model costs across closed-API and self-hosted open-weights options. The crossover for image generation is around 5K-50K images/day.