Midjourney Evolution. V1 – V5.1
In 2015, a research paper from Stanford University and UC Berkeley introduced diffusion models, coming originally from statistical physics, into the field of machine learning. According to the paper summary, “the essential idea is to systematically and slowly destroy the structure in a data distribution through an iterative forward diffusion process. We then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data.” That’s the basic idea used by the latest diffusion models, like DALL-E 2 or Stale Diffusion. However, the quality of generated images was quite poor back in 2015, as there was still huge room for improvement.
Five years later, in 2020, the research team from UC Berkeley introduced a seminal research paper with a few groundbreaking changes that led to a huge jump in the quality of the generated images. We’ll start our overview with this paper and then we’ll see what other influential research papers have revolutionized the field of image generation. If you’d like to skip around, here are the language models we featured:
- Denoising Diffusion Probabilistic Models by UC Berkeley
- Diffusion Models Beat GANs on Image Synthesis by OpenAI
- Stable Diffusion by Computer Vision and Learning Group (LMU)
- DALL-E 2 by OpenAI
- Imagen by Google
- ControlNet by Stanford
If this in-depth educational content is useful for you, you can subscribe to our AI research mailing list to be alerted when we release new material.
The Most Influential Research Papers on Image Generation with Diffusion Models
1. Denoising Diffusion Probabilistic Models by UC Berkeley
Summary
UC Berkeley researchers introduced Denoising Diffusion Probabilistic Models (DDPMs), a new class of generative models that learn to convert random noise into realistic images. DDPMs leverage the denoising score matching framework, which defines a distribution over images by a forward diffusion process that transforms images into noise. By training denoising functions to minimize the denoising score matching loss, DDPMs can generate high-quality samples from random noise.
What is the goal?
- To demonstrate that diffusion probabilistic models are capable of generating high-quality images.
How is the problem approached?
- The authors use a diffusion probabilistic model, which is a parameterized Markov chain trained using variational inference to produce samples matching the data after finite time.
- Transitions of this chain are trained to invert a diffusion process that gradually adds noise to the data, moving in the opposite direction of sampling until the signal is destroyed.
- If the diffusion process involves small quantities of Gaussian noise, the transitions of the sampling chain can be set to conditional Gaussians, making the neural network parameterization particularly straightforward.
- The research shows that the best results are obtained by training on a weighted variational bound designed according to a novel connection between diffusion probabilistic models and denoising score matching with Langevin dynamics.
What are the results?
- The authors demonstrated that diffusion models can be a suitable tool for generating high-quality image samples.
- Also, the model introduced in the research paper can interpolate images in the latent space, thus eliminating any artifacts that may be introduced by interpolating images in pixel space. The reconstructed images are of high quality.
- The authors also show that latent variables encode meaningful high-level attributes about samples such as pose and eyewear.
Where to learn more about this research?
- Research paper: Denoising Diffusion Probabilistic Models
- Also, get a more detailed summary of the paper with multiple generated samples on the paper’s website.
Where can you get implementation code?
- The official TensorFlow implementation of Denoising Diffusion Probabilistic Models is available on GitHub.
2. Diffusion Models Beat GANs on Image Synthesis by OpenAI
Summary
With this research, the OpenAI team challenged the GAN dominance in image generation by demonstrating that diffusion models can generate superior image quality. By leveraging a denoising score matching framework and a forward diffusion process, DDPMs learn to generate high-quality image samples from random noise. The study showcases the potential of this new class of models in various image synthesis applications, highlighting their ability to capture more diversity and provide more stable training and fewer mode collapse issues compared to GANs.
What is the goal?
- To demonstrate that diffusion models can outperform Generative Adversarial Networks (GANs) in unconditional image synthesis because even though GANs demonstrate state-of-the-art performance in terms of image quality, these models:
- capture less diversity;
- are often difficult to train, as they can easily collapse without carefully selected hyperparameters and regularizers.
How is the problem approached?
- The OpenAI researchers suggested bringing the benefits of GANs to diffusion models by:
- improving the model architecture;
- devising a scheme for trading off diversity for fidelity.
- Specifically, they were able to substantially boost the FID score by introducing, among others, the following architectural changes:
- Increasing depth versus width, holding model size relatively constant.
- Increasing the number of attention heads.
- Using attention at 32×32, 16×16, and 8×8 resolutions rather than only at 16×16.
- Using the BigGAN residual block for upsampling and downsampling the activations.
- Also, they have developed a technique for utilizing classifier gradients to guide a diffusion model during sampling.
- It was discovered that adjusting one specific hyperparameter – the scale of the classifier gradients – can be tuned to trade off diversity for fidelity.
What are the results?
- The results demonstrated that
- Diffusion models can obtain better sample quality than state-of-the-art GANs.
- On class-conditional tasks, the scale of the classifier gradients can be adjusted to trade off diversity for fidelity.
- Integrating guidance with upsampling enables further enhancement of sample quality for conditional image synthesis at high resolutions.
Where to learn more about this research?
- Research paper: Diffusion Models Beat GANS on Image Synthesis.
- For a deep-dive into the Denoising DIffusion Probabilistic Model (DDPM) introduced in the paper, check out the following Youtube video: DDPM – Diffusion Models Beat GANs on Image Synthesis (Machine Learning Research Paper Explained)
Where can you get implementation code?
- The official implementation of this paper is available on GitHub.
3. Stable Diffusion by Computer Vision and Learning Group (LMU)
Summary
The developers of Stable Diffusion models decided to address the problem of high computational cost and expensive inference in diffusion models (DMs), already known for their state-of-the-art synthesis results on image data. To tackle this issue, the researchers applied DMs in the latent space of powerful pretrained autoencoders, which allowed them to achieve a near-optimal balance between complexity reduction and detail preservation. They also introduced cross-attention layers to make the DMs more flexible and capable of handling general conditioning inputs like text or bounding boxes. As a result, their latent diffusion models (LDMs) achieved new state-of-the-art scores for image inpainting and class-conditional image synthesis, as well as competitive performance in tasks such as text-to-image synthesis, unconditional image generation, and super-resolution. Furthermore, LDMs significantly reduced computational requirements compared to pixel-based DMs.
What is the goal?
- To develop a method that enables diffusion models (DMs) to be trained with limited computational resources while retaining their quality and flexibility.
How is the problem approached?
- The research group suggested separating training into two distinct phases:
- Training an autoencoder to provide a lower-dimensional and perceptually equivalent representational space.
- Training diffusion models in the learned latent space, resulting in Latent Diffusion Models (LDMs).
- As a result, a universal autoencoding stage requires only one-time training, enabling efficient exploration of various image-to-image and text-to-image tasks.
- For the latter, the researcher designed an architecture connecting transformers to the DM’s UNet backbone for arbitrary token-based conditioning mechanisms.
What are the results?
- Latent diffusion model achieves competitive performance on multiple tasks and datasets with significantly lower computational costs.
- For densely conditioned tasks such as super-resolution, inpainting, and semantic synthesis, LDM can render large, consistent images of 1024*1024 px.
- The researchers also introduced a general-purpose conditioning mechanism based on cross-attention that enables multi-modal training of class-conditional, text-to-image, and layout-to-image models.
- Finally, they released the pretrained latent diffusion and autoencoding models to the general public to enable their reuse for various tasks, even beyond the training of diffusion models.
Where to learn more about this research?
- Research paper: High-Resolution Image Synthesis with Latent Diffusion Models
- Research webpage
Where can you get implementation code?
- The official implementation of this research is available on GitHub.
4. DALL-E 2 by OpenAI
Summary
OpenAI’s DALL-E 2 builds on the original DALL-E’s capabilities for text-guided image synthesis by addressing certain limitations and improving composability. The researchers trained DALL-E 2 on a dataset of 400 million image-text pairs to develop a generative model capable of synthesizing intricate and diverse images based on complex textual prompts.
What is the goal?
- To build a model that can:
- synthesize realistic images from a text description, while capturing both semantics and styles;
- enable language-guided image manipulations.
How is the problem approached?
- DALL·E 2 is a two-part model made of a prior and a decoder model:
- First, the prior model takes a text description and creates an image embedding from it. It is a computer analogy of the mental imagery that appears in human minds when we imagine a certain object (e.g., a small house near the lake).
- Next, the decoder model takes this image embedding and generates images. Similar to people who can draw different pictures with different details and in different styles from the same mental imagery, the decoder model can generate multiple images from the same image embedding by changing the details not specified in the text description.
What are the results?
- The numerous experiments demonstrate that with DALL-E 2, you can:
- create original and realistic images from text descriptions, where you can specify not only the attributes of an image but also its style (e.g., “in a photorealistic style”, “as a pencil drawing”);
- make realistic edits to existing images following the text instructions (note that shadows, reflections, and textures are also taken into account);
- take an image and create different variations inspired by this original image.
- The OpenAI team has also implemented a few safety mitigation measures to address common risks and limitations of the diffusion models, like for example, limiting the ability for DALL·E 2 to generate violent, hate, or adult images.
Where to learn more about this research?
- Research paper: Hierarchical Text-Conditional Image Generation with CLIP Latents
- Blog post: DALL-E 2 by the OpenAI team
Where to get implementation code?
- As of now, the implementation code for DALL-E 2 has not been released. However, you can refer to the research paper for details on the methodology and techniques employed.
5. Imagen by Google
Summary
Imagen is s a text-to-image diffusion model introduced by Google Research team. The model demonstrates a high degree of photorealism and deep language understanding. Building upon the strengths of large transformer language models (e.g. T5) and diffusion models, Imagen shows that increasing the size of the language model improves sample fidelity and image-text alignment more than increasing the image diffusion model’s size. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset without training on it, and human raters find its samples to be on par with COCO data in image-text alignment. The researchers also introduced DrawBench, a benchmark for text-to-image models, which shows that human raters prefer Imagen over other models, including VQ-GAN+CLIP, Latent Diffusion Models, GLIDE, and DALL-E 2, in terms of sample quality and image-text alignment.
What is the goal?
- Similar to DALL-E 2, the Imagen model, generates realistic images from test descriptions. The focus of this model is on unprecedented photorealism of the output images.
How is the problem approached?
- To generate photorealistic images from the input text, the algorithm goes through several steps:
- First, a large T5 language model is used to encode the input text into embeddings. The Google team claims that the size of the language model has a significant impact on both sample fidelity and image-text alignment.
- Then, a conditional diffusion model maps the text embedding into a 64×64 image.
- Finally, text-conditional super-resolution diffusion models are used to upsample the image (64×64→256×256 and 256×256→1024×1024) and get a photorealistic output image.
What are the results?
- Imagen produces 1024 × 1024 samples with unprecedented photorealism and image-text alignment.
- The authors claim that human raters prefer Imagen over other models (including DALL-E 2) in side-by-side comparisons, both in terms of image quality and alignment with text.
- Similar to the OpenAI team, the Google Research team decided not to release code or a public demo, having very similar concerns (i.e., generation of harmful content, enforcing social stereotypes).
Where to learn more about this research?
- Research paper: Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
- Blog post: Imagen: Text-to-Image Diffusion Model by the Google Research team.
Where can you get implementation code?
- The unofficial PyTorch implementation of Imagen is available on GitHub.
6. ControlNet by Stanford
Summary
ControlNet is a neural network structure designed by the Stanford University research team to control pretrained large diffusion models and support additional input conditions. ControlNet learns task-specific conditions in an end-to-end manner and demonstrates robust learning even with small training datasets. The training process is as fast as fine-tuning a diffusion model and can be performed on personal devices or scaled to handle large amounts of data using powerful computation clusters. By augmenting large diffusion models like Stable Diffusion with ControlNets, the researchers enable conditional inputs such as edge maps, segmentation maps, and keypoints, thereby enriching methods to control large diffusion models and facilitating related applications.
What is the goal?
- To build a framework that would allow more control over pretrained large diffusion models by supporting additional input conditions.
How is the problem approached?
- The researchers introduced ControlNet, an end-to-end neural network architecture that controls large image diffusion models to learn task-specific input conditions.
- First, ControlNet clones the weights of a large diffusion model into a “trainable copy” and a “locked copy”:
- The locked copy preserves the network capability learned from billions of images.
- The trainable copy is trained on task-specific datasets to learn conditional control.
- Next, the trainable and locked neural network blocks are connected with a unique type of convolution layer called “zero convolution”:
- Convolution weights progressively grow from zeros to optimized parameters in a learned manner.
- As a result:
- Preserved production-ready weights allow robust training at datasets of different scale.
- Zero convolution does not add new noise to deep features, making the training process as fast as fine-tuning a diffusion model.
- First, ControlNet clones the weights of a large diffusion model into a “trainable copy” and a “locked copy”:
What are the results?
- ControlNet is a game-changer in AI image generation as it allows much more control over the output images through multiple possible input conditions.
- Large diffusion models can be augmented with ControlNet to enable conditional inputs like edge maps, HED maps, hand-drawn sketches, human poses, segmentation maps, depth maps, keypoints, etc.
Where to learn more about this research?
- Research paper: Adding Conditional Control to Text-to-Image Diffusion Models
- Blog post by ControlNet developers: Ablation Study: Why ControlNets use deep encoder? What if it was lighter? Or even an MLP?
Where can you get implementation code?
- The official implementation of this paper is available on GitHub.
Real-world Applications of Diffusion Models for Image Generation
Diffusion models for image generation have made significant strides in recent years, opening up a wide array of real-world applications.
- Text-to-image diffusion models could transform graphic design:
- Generating an image using AI is much cheaper than hiring a human graphic designer.
- However, graphic designers may evolve into a crucial interface between their clients and this technology. They can learn the nuances of AI models and also decide on the creative filters to be applied to generate images (e.g., Vincent van Gogh style or Andy Warhol style).
- These models can also disrupt the art industry:
- Artists might feel threatened by systems such as DALL-E 2, but in fact, these models have many limitations that wouldn’t allow them to fully substitute artists. Most importantly, the modern AI models do not understand the underlying realities and relationships between different objects.
- So, more likely, this technology will assist certain artists who will guide the AI models to some interesting and creative outputs, becoming an interface between technology and customers.
- Similarly, image generation powered by diffusion models is likely to revolutionize a few more industries, including photography, marketing, advertising, and others.
However, in its current state of development, this technology has a number of significant risks and limitations.
Risks and Limitations
To begin with, AI image generation tools are in a grey zone when it comes to the legal aspects of training these AI models on copyrighted images, generating new images in the style of other artists, and defining ownership over the output images. Clear regulations are required to protect original artists, whose works were used to train AI generation models, but also to recognize the contribution of AI creators, who master their prompting skills and generate remarkable artwork using AI.
Then, it’s important to remember that there are numerous well-known malicious uses of image generation models:
- AI image generators can be used to produce harmful content, including images related to violence, harassment, illegal activity, and hate.
- They can also be employed to produce fake images and videos of high-profile figures.
- Generative models also reflect the biases in the datasets on which they are trained. If samples from generative models trained on these datasets proliferate throughout the internet, then these biases will only be reinforced further.
AI image generators incorporate various filters to prevent generating harmful content, but these filters can be circumvented. It’s especially easy to do when the code is open-sourced, like in the case of Stable Diffusion.
In addition to numerous risks, AI text-to-image generators have their limitations:
- First of all, it’s a lack of control, when you cannot recreate an image you have in mind, no matter how detailed is your prompt.
- Image generators also have challenges with creating complex compositions, dynamic poses, and large crowds.
- As of now, they are unable to accurately depict letters, words, and symbols in images.
- Finally, AI image generation tools will not allow you to go too far in the stylization of your images if this requires significant deviation from proper structure and anatomy.
Conclusion
Despite the considerable limitations of current AI image generators, there are reasons for optimism regarding the future of this technology. Over the past year, the field has witnessed tremendous advancements, and it is reasonable to expect that some of the technology’s shortcomings will be addressed in the near future.
ControlNet is an example of a recent development that gives AI creators more control over the output images, while Stable Diffusion’s team is working on accurately generating words within images. Midjourney is implementing a new AI moderation system to block harmful content, but also avoid wrongly banning innocent prompts.
The aforementioned instances serve as a mere glimpse into the efforts being made in the field to enhance the abilities and ethical considerations of AI-generated images. As AI continues to evolve, we can anticipate increasingly sophisticated models that can better understand the real world and produce more accurate and diverse images.
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more summary articles like this one.
Leave a Reply
You must be logged in to post a comment.