DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Two-Tower Model for Fraud Detection: A Comprehensive Guide
  • Multi-Touch Attribution Models
  • Navigating the Complexities of Text Summarization With NLP
  • The Perils of AI Hallucination: Unraveling the Challenges and Implications

Trending

  • Monolith: The Good, The Bad and The Ugly
  • Scaling Microservices With Docker and Kubernetes on Production
  • How to Create a Successful API Ecosystem
  • Build a Simple REST API Using Python Flask and SQLite (With Tests)
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. How Midjourney and Other Diffusion Models Create Images from Random Noise

How Midjourney and Other Diffusion Models Create Images from Random Noise

Diffusion models, such as Midjourney and DALL-E 2 can produce stunning images, but their training always begins with complete noise.

By 
Rostyslav Demush user avatar
Rostyslav Demush
·
Apr. 22, 23 · Analysis
Likes (2)
Comment
Save
Tweet
Share
6.6K Views

Join the DZone community and get the full member experience.

Join For Free

Colossal hype ensues whenever there's progress or even the slightest shift in the world of machine learning (ML). When it comes to AI, the craze always ends up growing out of proportion and insane claims invariably follow. You've probably noticed this recently as more people are beginning to use Midjourney, ChatGPT, Copilot, and so on.

The truth about these systems, however, is always more prosaic. Machine learning is mathematical, and the real implications of the ML models are not nearly as profound as some bloggers would have you believe. They could be beneficial and, in some cases, transform large portions of workflows in specific areas, but only if the user, whether an organization or an individual, has a sufficient understanding of their inner workings, limitations, capabilities, and potential.

This article hopes to shed some light on how diffusion models work, such as the highly popular Midjourney, DALL-E 2, and Stable Diffusion, and particularly how they are trained. This post includes mathematical expressions, but it also describes what each variable represents in an accessible manner.

Forward Diffusion

All diffusion models, including Midjourney (although we don't have a paper describing it yet), are multi-element neural networks that create images from randomness. Initially, they are trained to gradually turn pictures into noise. This method establishes a Markov chain* consisting of timesteps, in which the image goes through a series of transformations and evolves from an initial pure state at t=0 to total noise at t=T, the final step.

*A Markov chain is a sequence of variables in which the state of one variable depends only on the state of the previous variable.

The number of timesteps, which can range from a few hundred to a thousand or more, and the noise levels to be applied at each step must be predetermined; this is known as the noising schedule. Mathematically, this noising or the forward diffusion process is denoted as:

q(xt|xt-1) = N(xt; √(1-βt) xt-1, βtI)

q(x0) is our real distribution and q(xt∣xt−1) is the forward diffusion process in which xt is always conditioned on xt-1. The symbol N stands for Gaussian or normal distribution, which is defined by the mean μ and variance σ^2. In this case, the mean is represented by √(1-βt)xt-1, where βt is our variance. The noise is sampled from a normal distribution at each step ϵ∼N(0,I) and the variance schedule is predetermined.

In simpler terms, we have a normal distribution over the current step where the mean is represented by √(1-βt) times the image of the previous step xt-1. In addition to this rescaling, we also add a small amount of noise βtI to the pic on each iteration. Think of β as a tiny, positive scalar value, for example, 0.001, and it is small on purpose.

So, this is what's done at each time step. But, we can also define a joint distribution for all the samples that will be generated in the xt, x2, x3, ..., xT sequence. It will look like this:

q(x1:T|x0) = ∏t=1:T q(xt|xt-1)

As you can see, this joint distribution is represented by the product ∏ of conditional distributions q(xt|xt-1), created at timesteps 1 through T, and then given the initial image x0.

OK, But Can We Skip Links in the Chain to Generate Any xt Without Going Through All the Preceding Steps?

Yes, we can.

This is possible because we use a simple Gaussian kernel to diffuse the data. To do this, we compute the scalar value αt, which is equal to 1 - βt, and we define the variable ᾱt as the product of αs from t1 to t. The forward diffusion variances βt are essential to the process. They can be learned by reparameterization or held constant as hyperparameters, but they are always designed to make ᾱt approach 0 on the final step T. This ensures that the diffused data has a normal distribution, which is crucial for the reverse generative process.

After we have the kernel, we can just sample any xt, because: 

xt = √ᾱt x0 + √(1 - ᾱt)ε,  where ε (noise) is derived from a normal distribution with a mean of 0 and an Identity covariance matrix I. 

In lay terms, if we need xt, which represents a random step in our Markov chain, we can generate it with no problem given that x0, ᾱt, and the noise term ε are available.

Generative Process

Let's now delve into the reverse process, where the model generates new samples. The first thing to understand is that we cannot compute the denoising distribution q(xt-1|xt) directly, as it would require knowing the distribution of all the images in the dataset. Although, we can use Bayes' rule to show that this distribution is proportional to the product of the marginal data distribution q(xt-1) and our diffusion kernel at step t - q(xt|xt-1):

q(xt-1|xt) ∝ q(xt-1) q(xt|xt-1)

But, the product and the distribution would still be intractable. So, what we need to do is to approximate the conditional probability distribution. Luckily, we can do so by using a normal distribution since the noise injections βt were small during the forward process. 

We can represent our approximation of the conditional probability distribution as pθ(xt−1∣xt), where θ is the model's parameters that are iteratively optimized through gradient descent.

When we remember that the end of our Markov chain is a normal distribution, we can assume that the backward process will also be Gaussian. Therefore, it must be parameterized by the mean μθ and variance Σθ, which our neural network will have to compute.

This is the process parameterization: pθ(xt−1∣xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))

In the original paper, Denoising Diffusion Probabilistic Models, researchers found that fixing the variance Σθ(xt, t) = σ2tI was an optimal choice in terms of sample quality. In particular, they found that fixing the variance to β gives roughly the same results as fixing it to βt, since when we add diffusion steps to the process, β and βt remain close to each other, therefore the mean, not variance, is actually what determines the distributions.

Note: in the paper "Improved Denoising Diffusion Probabilistic Models," which came out a bit later, the researchers did parameterize the variance too. It helped improve the log-likelihood but did not increase sample efficiency.

So How Do We Determine the Objective Function to Derive the Mean?

Since q and pθ can be viewed as a variational autoencoder, a model that approximates the data distribution by encoding data into a latent space* and then back into the input space, we can use the variational upper bound (ELBO) objective function to train it. It minimizes the negative log-likelihood with respect to x0. Here, the variational lower bound is the sum of losses on each time step, and each term of the loss is the KL divergence** between the two Gaussian distributions.

*Latent space is a compressed space in which the input features are represented by separate dimensions; it helps the models easily find patterns and similarities between data objects.

**KL divergence is a measurement of the distance between two distributions. It basically tells you how much information you would lose if you approximate the target distribution with a model distribution.

As you recall, we can sample xt at any noise level that is conditioned on x0. Because of q(xt|x0) = N(xt;√(α̅t)x0, (1-α̅t)I), we're able to add and scale noise to x0 in order to get to any xt. Additionally, since α̅t are functions of βt, the predetermined variance, we can easily optimize the random terms of the loss function L during training.

Another important benefit of this property is that we can turn our network into a noise predictor instead of a mean predictor. Specifically, the mean can be reparameterized to make the model approximate added noise at step t using εθ(xt, t) in the KL divergence terms. Like: 

μθ(xt, t) = (1/√αt) (xt - (βt/√(1-α¯t)) εθ(xt, t))

Finally, we arrive at this equation for the objective loss function Lt (at a random time step, given that the noise is sampled from the random distribution N (0, I)):

||ε - εθ(xt, t)||^2 = ||ε - εθ(√α¯t x0 + √(1-α¯t)ε, t)||^2

Where x0 is the uncorrupted image, ϵ is the pure noise sampled at time t, and εθ(xt, t) is the predicted noise obtained by passing the approximated value xt through our neural network, which is parametrized by θ.

The network is optimized by the mean squared error between predicted and true noise. By minimizing the distance between true and predicted errors, we teach the model to make progressively more accurate approximations.

So, in summary, here’s a sequence of training steps that gives us models, such as Midhjourney and Stable Diffusion, that can generate images from pure noise.


From the original paper.

Machine learning Markov chain Data (computing) neural network

Opinions expressed by DZone contributors are their own.

Related

  • Two-Tower Model for Fraud Detection: A Comprehensive Guide
  • Multi-Touch Attribution Models
  • Navigating the Complexities of Text Summarization With NLP
  • The Perils of AI Hallucination: Unraveling the Challenges and Implications

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!