Diffusion Models for Images
Image generation runs on diffusion: a model that learns to reverse a noise-adding process. The math is elegant; the engineering is what makes it work at production quality.
The diffusion idea
Take an image. Repeatedly add small amounts of Gaussian noise until it’s pure noise. Train a model to predict the noise added at each step. To generate: start from noise, repeatedly subtract predicted noise, end up with an image.
Conditional generation: condition the noise-prediction on a text prompt embedding. The model learns to produce images that match descriptions.
Architectures
- U-Net: the original Stable Diffusion choice. Convolutional encoder/decoder with skip connections.
- DiT (Diffusion Transformer): replace U-Net with a transformer. Scales better; powers SDXL, Flux, and most 2024-2025 frontier image models.
Production models
- Stable Diffusion 3: open weights, strong text rendering.
- Flux.1 (Pro/Dev/Schnell): BlackForest’s model. Open dev variant; stunning quality.
- Midjourney v6/v7: closed, strongest aesthetic quality.
- DALL·E 3 / Imagen 3: closed, integrated with chat models.
Practical controls
- Classifier-free guidance: dial how strictly the model adheres to the prompt. Higher = more literal, less creative.
- Sampler: Euler-A, DPM++ 2M, etc. Different speed/quality tradeoffs.
- Steps: 20-50 typical. More steps = slightly better quality, sharply higher cost.
- LoRAs and ControlNets: condition the diffusion process on style references or pose/depth maps.