Note: Quite frankly, there are already a zillion articles out there explaining the intuition behind GANs. While I will briefly touch upon it, the rest of the article will be an absolute deep dive into the GAN architecture and mainly coding — but with a very very detailed explanation of the pseudocode (open-sourced as an example by PyTorch on Github).
Why do I need GANs?
To put it simply, GANs let us generate incredibly realistic data (based on some existing data). Be it human faces, songs, Simpsons characters, textual descriptions, essay summaries, movie posters — GANs got it all covered! At Podurama, we are currently using them for high-resolution thumbnail synthesis.
If this in-depth educational content is useful for you, you can subscribe to our AI research mailing list to be alerted when we release new material.
How does it even work?
To generate realistic images, GANs must know (or more specifically learn) the underlying distribution of data.
What does that even mean?
It means we must feed it samples (images) from a specific distribution (of all cats or all human faces or all digits) such that if it is asked to generate images of cats, it must somehow be aware that a cat has four legs, a tail, and some whiskers. Or, if it is asked to generate digits, it must know how each digit looks like. Or, if it is asked to generate human faces, it must know that a face must contain two eyes, two ears, and a nose.
But how does probability distribution fit in with images?
While it is certainly easy to visualize a 1-dimensional distribution curve, say a normal histogram to plot ‘height’ distribution, or even a 2d distribution curve using a contour plot of height and weight distribution (if you are feeling smug); it is not so easy with image data.
On left, we have a 1d probability distribution curve and on right, we have a 2d normal distribution curve. [Source]
In the case of images, we are dealing with a high-dimensional probability distribution.
How “high” a dimension are we talking about here?
Generally speaking, we use 32×32 images and each of them is colored, meaning three additional channels to capture the RGB component. So, our probability distribution has 32 * 32 * 3 ≈ 3k dimensions. That is, the probability distribution will go over each pixel in all of the images. Finally, the distribution that emerges will determine whether an image is normal or not (more on this in about 45 seconds).
Can I still visualize it somehow?
To put this in some perspective (and coming back to the original topic of probability distributions for images), let’s take an example of the handwritten digits represented by features x1
and x2
. That is, we have used some sort of dimensionality reduction technique to bring down 3k dimensions to 2 dimensions only, such that we can still represent all ten digits (0–9) along these two dimensions.
As you can see, the probability distribution, in this case, has many peaks, ten to be precise, one corresponding to each of the digits. These peaks are nothing but the modes (a mode in a distribution of data is just an area with a high concentration of observations), i.e. our distribution of digits is a multimodal distribution.
Different images of the digit 7 are represented by similar x1
and x2
pairs where x1
usually tends to be on the higher side compared to x2
. Similarly for the digit 5, both x1
and x2
dimensions have lower values compared to that of digit 7.
Now if we have done a great job at training our GAN properly (in other words, it has learned the probability distribution correctly), then we won’t have one of GAN’s output images ending up in the space between 5 and 7, i.e. in areas of very very low density. If not, we can be quite certain that the digit produced would probably look like a love-child of digits 5 and 7 (in short, random noise) and thus, not one of the ten digits that we care about!
Overview of the GAN architecture
To ensure GANs are able to replicate a probability distribution nicely, their architecture is essentially composed of two neural networks competing against one another — a Discriminator and a Generator.
A Generator’s job is to create fake images that look real.
A Discriminator’s job is to correctly guess whether an image is fake (i.e. generated by the Generator) or real (i.e. coming directly from the input source).
Once the Generator becomes good enough at creating (fake) images that the Discriminator perceives as real (i.e. we have deceived the Discriminator), our job is done and the GAN is trained.
In terms of coding these two neural networks,
- Discriminators can be thought of as simple binary image classifiers, which take as input an image and spits out whether the image is real (output = 1) or fake (output = 0).
- Generators are somewhat more complex, in that they take as input some random numbers or noise (say a vector of size 100; where the choice of 100 is arbitrary) and try to perform some computations on it during the hidden layers such that the final output is an image (or more specifically a vector of size h*w*c where h is the image height; w is image width and c is the number of channels i.e. c=3 for a colored RGB image).
Note: This image, although fake, must have the same dimensions as that of the real image i.e. if the size of real images in our source data is 32*32*3, then output from the generator should also be an image of size 32*32*3.
Objective of the Discriminator
Generally speaking (and this will come in handy when we are coding our GANs from scratch in Part 2), the aim of the discriminator is to be the best at determining fake images from the real ones. As a result, when calculating the amount of error that discriminator makes during the training phase, we must include two instances:
- Real error (or positive error): the amount of error made when discriminator is passed real images.
- Fake error (or negative error): the amount of error made when discriminator is passed fake images (created by generator).
The sum of the positive and negative error is what will be optimized during the training process.
Mathematically speaking, the discriminator’s objective is to:
max { log D(x) + log (1- D(G(z))) }
D: Discriminator
G: Generator
x: real image
z: noise vector
As you can see, the objective function has two parts to it, both of which need to be maximized in order to train the discriminator. Maximizing log (D(x)) takes care of the positive (or real) error whereas maximizing log (D(G(z))) takes care of the negative (or fake) error. Let’s see how…
Why should we maximize log (D(x)) in the above equation?
As we mentioned earlier, a discriminator is essentially a binary classifier and thus, D(x) will generate a value between 0 and 1, establishing how real (or fake) it thinks the input image is.
Since x is a real image,
- in an ideal world (one where D is trained perfectly to recognize fakes from real), D(x) output should be ≈ 1. That means log (D(x)) will be roughly equal to 0.
- in a not-so-ideal world (where D is still learning), D(x) would output, say 0.2, meaning it is only 20% confident that the image is real. That means log (D(x)) will be -0.69.
In short, when passed real images, the discriminator’s objective is to maximize log (D(x)), increasing it from a meager -0.69 to 0 (maximum achievable value).
Why should we maximize log (1-D(G(z))) in the above equation?
Since z is a noise vector, passing it to the Generator G should output an image. In other words, G(z) will be an image, lets call it fake_image:
- in an ideal world (one where D is trained perfectly to recognize fakes from real), passing fake_image to D will result in D(G(z)) ≈0. Consequently, log (1– *a-very-small-value*) will be roughly equal to 0.
- in a not-so-ideal world, passing fake_image to D will result in D(G(z)) ≈0.99 as D is not well trained and it thinks the fake image is real. Consequently, log (1–0.99) will be roughly equal to -2.
In short, when passed real images, the discriminator’s objective is to maximize log (1-D(G(z))), increasing it from a meager -2 to 0 (maximum achievable value).
Objective of the Generator
For a generator, the biggest challenge is to produce an image that is realistic enough to fool the Discriminator. In other words, calculating the amount of error that the Generator makes during the training phase can be easily figured out with the help of the Discriminator.
Speaking strictly from the Generator’s point of view, it would like the Discriminator to churn out output = 1
(or a very high number close to 1) when one of its fake images is given as input.
Mathematically speaking, this is precisely what a Generator’s objective is:
max { log D(G(z)) }
Now one might wonder, why does the Generator’s objective function have only one term to maximize whereas Discriminator’s had two. That’s because a Discriminator must deal with both fake and real images as inputs and so we must calculate the loss separately. However, a Generator never has to deal with real images since it never sees them (remember: a Generator’s input is some random noise, not a real image) and so no need for an additional loss term.
Overview of Discriminator Neural Network
Note: While I will be discussing how to build GANs (DCGANs to be more specific) from scratch in Part 2, right now we are going to look at what all hidden layers are at play in a Generator neural network.
As mentioned, the discriminator, D, is a binary classification network that takes an image as input and outputs a scalar probability that the input image is real (as opposed to fake).
Here, D takes a 32*32*3 input image, processes it through a series of Conv2d, BatchNorm2d, Dropout, and LeakyReLU layers, and outputs the final probability through a Sigmoid activation function.
P.S. Do not worry if you don’t understand what each layer does, that’s what we will be covering in Part 2!
Overview of Generator Neural Network
As mentioned, Generator G is a neural network that tries to produce (hopefully) realistic-looking images. To do so, it takes as input a random noise vector z and tries to create an RGB image with the same size as the training images i.e. 32*32*3 (see image below). To do so, it processes z through a series of strided Conv2d transpose layers, each paired with a 2d BatchNorm layer and a relu activation.
Note: The spatial size of the images used here for training is 32*32*3. In case you are working with another size, the structure of both D and G must be updated.
Steps involved in training a GAN
Regardless of whichever framework you choose to code your GAN, the steps more or less remain the same.
for each epoch: for each batch b of input images: ############################## ## Update Discriminator - D ## ############################## # loss on real images clear gradients of D pred_labels_real = pass b through D to compute outputs true_labels_real = [1,1,1....1] calculate loss(pred_labels_real, true_labels_real) calculate gradients using this loss # loss on fake images generate batch of size b of fake images (b_fake) using G pred_labels_fake = pass b_fake through D true_labels_fake = [0,0,....0] calculate loss(pred_labels_fake, true_labels_fake) calculate gradients using this loss update weights of D ############################## #### Update Generator - G #### ############################## clear gradients of G pred_labels = pass b_fake through D true_labels = [1,1,....1] calculate loss(pred_labels, true_labels) calculate gradient using this loss update weights of G ################################################ ## Optional: Plot a batch of Generator images ## ################################################
We begin with an outer loop stating how many epochs do we want our code to run for. If we are going to set epochs = 10
meaning the model is going to train on all of the data 10 times. Next, instead of working with all the images in our training set at once, we are going to draw out small batches (say of size 64) in each iteration.
Refrain from using an exceptionally large value for batch size since we do not want the Discriminator getting too good too soon (as a result of having access to too much training data in initial iterations) and overpowering the Generator.
The training process is further split up into two parts — updating Discriminator and updating Generator (and an optional third part where you throw in some random noise into the Generator (say, every 50th iteration) and check the output images to see how good it is doing).
Updating Discriminator
For updating the Discriminator (or rather updating Discriminator’s weights with each epoch to minimize the loss), we pass a batch of real images to the Discriminator and generate the output. The output vector will contain values between 0 and 1. Next, we compare these predicted values to their true labels i.e. 1 (by convention, real images are labeled as 1 and fake images are labeled as 0). Once the discriminator’s loss over real images is calculated, we calculate the gradient i.e. take the derivative of the loss function with respect to the weights in the model.
Next, we pass some random noise input to the Generator and produce a batch of fake images. These images are then passed onto a Discriminator which generates the predictions (values between 0 and 1) for these fakes. Next, we compare these predicted values to their true labels i.e. 0, and compute the loss i.e. how far the predicted labels are from the true labels. Once, the loss over fake images has been calculated, derivative of the loss function is used to calculate gradients, just like in the case of real images. Finally, the weights are updated based on the gradient (w = w + learning_rate * w.gradient) to minimize the overall Discriminator loss.
Updating Generator
A very similar sequence of steps is used when updating the Generator. In there, we start with passing a batch of fake images (generated during Discriminator training) to the Discriminator. Now one might wonder — why did we pass the fake batch through the Discriminator for a second time? Didn’t we just do that during Discriminator training? The reason is, the Discriminator D got updated before we started updating the Generator and so a forward pass of the fake batch is essential.
Next, the loss is calculated using the output from the Discriminator and the true label of the images. One important thing to note is that, even though these images are fake, we set their true label as 1 during loss calculations.
But why, we thought 1 was reserved as a label for real images only!
To answer this, I am going to re-iterate a line from this article itself:
Speaking strictly from the Generator’s point of view, it would like the Discriminator to churn out
output = 1
(or a very high number close to 1) when one of its fake images is given as input.
Because the Generator wants the Discriminator to think it is churning out real images, it uses the true labels as 1. This way, the loss function translates to minimizing how far D’s output for fake images is from D’s output for real images (i.e. 1).
Finally, the weights are updated based on the gradient to minimize the overall Generator loss.
The optional code for generating images when noise is fed into the Generator will be discussed in Part 2.
Conclusion
In a nutshell, the whole point of training a GAN network is to obtain a Generator network (with the most optimal model weights and layers, etc.) that is excellent at spewing out fakes that look real. After we do so, we are able to input a point from latent space (say a 100-dimensional Gaussian-distribution vector) into the Generator and it is only our Generator that knows how to convert that random noise vector into a realistic-enough image that looks like it could belong to our training set!
Until then 🙂
Podurama is the best podcast player to stream more than a million shows and 30 million episodes. It provides the best recommendations based on your interests and listening history. Available for iOS Android MacOS Windows 10 and web. Early users get free lifetime sync between unlimited devices.
This article was originally published on Medium and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more computer vision updates.
We’ll let you know when we release more technical education.
Leave a Reply
You must be logged in to post a comment.