Diffusion models are the underlying technology behind state-of-the-art image generation systems like Stable Diffusion, Dall-E 2, and Google Research's Imagen. In this post, we will discuss how diffusion works by framing image generation as a denoising process.
Diffusion models train on corrupted images where noises of different magnitudes have been added. This training data helps the model learn to predict and remove noise at different levels, gradually revealing the underlying image. During generation, the model starts with a highly noisy image and iteratively predicts and removes slices of noise, transforming the white noise into a realistic image that aligns with the distribution of images in its training data.
We will explore this process, looking at how a diffusion model is trained to predict noise and how that trained model can then be used to generate novel images by gradually removing noise. This will give us insights into the key components of diffusion-based generative image models, like text encoders that condition the generated images on text prompts and the overall training procedure that makes these systems work.
Also known as "checkpoint", it's the results of training, usually distributed as a single file
Household name for training a model
Training scripts and other Stable Diffusion-related technologies developer
The most common Stable Diffusion generation tool
An extension for WebUI, like a plugin. Can be added from WebUI's extensions tab.
AUTOMATIC1111, author of webui
The system that controls how the machine learns images and some unknown decision/association properties
The system that translates your prompt's words or tokens into data the AI understands
A text encoder; typically the one we will be training. Stable Diffusion v2 models use OpenCLIP instead
Also known as "rank", it's the total capacity of the model, usually reflected as a bigger file
Actually, this is less of an AI and more "machine learning", but it's easier to call it "AI" informally
A different type of models training that results in larger files (2-4GB)
The type of training covered in this guide. The formal spelling is "LoRA", from "Low-Rank Adaptation"
Also known as "textual inversion". It's an older style that only trains the text encoder
Similar to an embed, but acting on the Unet instead
Training a character, object, vehicle, background...into a model
Training a model to reproduce a specific aesthetic
Training a model to reproduce something like a pose or composition
A combination of your training images and tags
Household term for extracting a LORA from a bigger model
A model tries to reproduce the training set too aggressively, usually a result of a burned Unet
An effect where the generated images have very saturated colors, usually a result of a high CFG scale
A smaller AI that gives you the tags of the things it finds in an image
The central idea of generating images with diffusion models relies on the fact that we have powerful computer vision models. Given a large enough dataset, these models can learn complex operations.
Let’s make some noise
Diffusion models approach image generation by framing the problem as follows:
Say we have an image, we generate some noise, and add it to the image.
This can now be considered a training example. We can use this same formula to create lots of training examples to train the central component of our image generation model.
While this example shows a few noise amount values from the image (amount 0, no noise) to total noise (amount 4, total noise), we can easily control how much noise to add to the image, and so we can spread it over tens of steps, creating tens of training examples per image for all the images in a training dataset.
With this dataset, we can train the noise predictor and end up with a great noise predictor that actually creates images when run in a certain configuration. A training step should look familiar if you’ve had ML exposure:
Now we can learn how this can generate images….
Painting images by removing noise
The trained noise predictor can take a noisy image, and the number of the denoising step, and is able to predict a slice of noise.
The sampled noise is predicted so that if we subtract it from the image, we get an image that’s closer to the images the model was trained on (not the exact images themselves, but the distribution - the world of pixel arrangements where the sky is usually blue and above the ground, people have two eyes, cats look a certain way – pointy ears and clearly unimpressed).
If the training dataset was of aesthetically pleasing images (e.g., LAION Aesthetics, which Stable Diffusion was trained on), then the resulting image would tend to be aesthetically pleasing. If we train it on images of logos, we end up with a logo-generating model.
This concludes the description of image generation by diffusion models mostly as described in Denoising Diffusion Probabilistic Models. Now that you have this intuition of diffusion, you know the main components of not only Stable Diffusion but also Dall-E 2 and Google’s Imagen
Illustration: How Stable Diffusion Combines Encoding, Decoding, and Diffusion
For further in-depth reading on this topic, please visit the website below:
The Illustrated Stable Diffusion
Translations: Vietnamese. (V2 Nov 2022: Updated images for more precise description of forward diffusion. A few more images in this version) AI image generation is the most recent AI capability blowing people’s minds (mine included). The ability to create striking visuals from text descriptions has a magical quality to it and points clearly to a shift in how humans create art. The release of Stable Diffusion is a clear milestone in this development because it made a high-performance model available to the masses (performance in terms of image quality, as well as speed and relatively low resource/memory requirements). After experimenting with AI image generation, you may start to wonder how it works. This is a gentle introduction to how Stable Diffusion works. Stable Diffusion is versatile in that it can be used in a number of different ways. Let’s focus at first on image generation from text only (text2img). The image above shows an example text input and the resulting generated image (The actual complete prompt is here). Aside from text to image, another main way of using it is by making it alter images (so inputs are text + image).
Subscribe to our newsletter today and be the first to receive ArtVenture news and posts.
Sign up now and get started on your AI Art journey!