Likewise, we admire the story of musicians, artists, writers and every creative human because of their personal struggles, how they overcome life’s challenges and find inspiration from everything they’ve been through. That’s the true nature of human art. That’s something that can’t be automated, even if we achieve the always-elusive general artificial intelligence. — Ray Dickson, BD TechTalks
This article will be a tutorial on using neural style transfer (NST) learning to generate professional-looking artwork like the one above. NST has been around for a while and there are websites that perform all of the functions before you, however, it is very fun to play around and create your own images. NST is quite computationally intensive, so in this case, you are limited not by your imagination, but primarily by your computational resources.
All the code used in this article is available on a Jupyter notebook provided on my Neural Networks GitHub page. By the end of this article, you will have all the resources necessary to generate your own work using any images.
Introduction
Neural style transfer (NST) can be summarized as the following:
Artistic generation of high perceptual quality images that combines the style or texture of some input image, and the elements or content from a different one.
From the above definition, it becomes clear that to produce an image using NST we require two separate images. The first image is one that we wish to transfer the style of — this could be a famous painting, such as the “Great Wave off Kanagawa” used in the first image we saw. We then take our second image and we transform this image using the style of the first image in order to morph the two images. This is illustrated in the images below, where image A is the original image of a riverside town, and the second image (B) is after image translation (with the style transfer image shown in the bottom left).
This tutorial will explain the procedure in sufficient detail to understand what is happening under the hood. To understand this we will first have to look at some other aspects of convolutional neural networks. The following topics that will be discussed are:
- Visualizing convolutional networks
- Image reconstruction
- Texture synthesis
- Neural style transfer
- Code implementation
- DeepDream
Time to get started.
If this in-depth educational content on computer vision is useful for you, subscribe to our AI research mailing list to be alerted when we release new material.
Visualizing Convolutional Networks
Why would we want to visualize convolutional neural networks? The major reason is that with neural networks we have little insight about learning and internal operation
Through visualization we may:
- Observe how input stimuli excite the individual feature maps,
- Observe the evolution of features
- Make more substantiated designs.
We will be using an architecture similar to that of AlexNet [2] to explain NST in this article.
The details are outlined in “Visualizing and understanding convolutional networks” [3]. The network is trained on the ImageNet 2012 training database for 1000 classes. The input is images of size 256 x 256 x 3, and the network uses convolutional layers and max-pooling layers, with fully connected layers at the end.
For visualization, the authors employ a deconvolutional network [4]. The objective of this is to project hidden feature maps into the original input space. This allows us to visualize the activation functions of a specific filter. The name “deconvolutional” network may be unfortunate since the network does not perform any deconvolutions.
Deconvolutional Network Description
There are several aspects to this deconvolutional network: unpooling, rectification, and filtering.
Unpooling
- The max-pooling operation is non-invertible.
- Switch variables — record the locations of maxima.
- It places the reconstructed features into the recorded locations.
Rectification — Signals go through a ReLu operation.
Filtering — Use of transposed convolution.
- Filters are flipped horizontally and vertically.
- Transposed convolution projects feature maps back to input space.
- Transposed convolution corresponds to the backpropagation of the gradient (an analogy from MLPs).
How do we perform feature visualization?
(1) Evaluate the validation database on the trained network.
(2) Record the nine highest activation values of each filter’s output.
(3) Project the recorded 9 outputs into input space for every neuron.
- When projecting, all other activation units in the given layer are set to zero.
- This operation ensures we only observe the gradient of a single channel.
- Switch variables are used in the unpooling layers
We can now look at the output of the layers of AlexNet using this technique.
We can see from the above images that the earlier layers learn more fundamental features such as lines and shapes, whilst the latter layers learn more complex features.
How do we test feature evolution during training?
We can look at the feature evolution after 1, 2, 5, 10, 20, 30, 40 and 64 epochs for each of the five layers.
There are a few things we can note about the network:
- Low layers converge soon after a few single passes.
- The fifth layer does not converge until a very large number of epochs.
- Lower layers may change their feature correspondence after converging.
How do we know this is the best architecture?
We can perform architecture comparison, where we literally try two architectures and see which one does best. We can do this by checking if different architectures respond similarly or more strongly to the same inputs.
We see in the above image that there is evidence that there are less dead units on the modified (left) network, as well as more defined features, whereas Alexnet has more aliasing effects.
Image reconstruction
This is where things get a bit involved mathematically. This is necessary to understand if you want to know the inner workings of NST, if not, feel free to skip this section. This section will follow explanations given in “Understanding deep image representations by inverting them” [5].
We are able to reconstruct an image from latent features. We can train layers in a network to retain an accurate photographic representation about the image, retaining geometric and photometric invariance. To make it clear, the notation a[l] in the equation below corresponds to the latent representation of layer l. Our job is to solve the optimization problem:
We can also regularize this optimization procedure using an α-norm regularizer:
as well as a total variation regularizer:
This will become clearer in the code implementation later.
The steps for image reconstruction are:
- Initialize y with random noise.
- Feedforward pass the image.
- Compute the loss function.
- Compute gradients of the cost and backpropagate to input space.
- Update generated image G with a gradient step.
This procedure is used to generate the example images below.
Texture Synthesis
The purpose of texture synthesis is to generate high perceptual quality images that imitate a given texture. This is done using a trained convolutional neural network for object classification. We employ correlation of features among layers as a generative process. Here is an example of texture synthesis:
The output of a given layer will look like this:
To compute the cross-correlation of the feature maps, we first denote the output of a given filter k at layer l using a with subscripts ijk and superscript l. The cross-correlation between this output and a different channel k’ is:
The Gram matrix is defined as:
where
Generating new textures
To create a new texture, we can synthesize an image that has a similar correlation to the one we want to reproduce. G with superscripts [l] and (S)refers to the Gram matrix of the style image, and G with superscripts [l] and (G) refers to the newly generated image.
where
refers to the Frobenius norm. We combine all of the layer losses into a global cost function:
for given weights λ1, . . . , λL.
Process description
Now we know all of the details, we can illustrate this process in full:
For further details, I refer you to the paper “Texture synthesis using convolutional neural networks” [6].
Neural Style Transfer
NST was first published in the paper “A Neural Algorithm of Artistic Style” by Gatys et al, originally released to ArXiv 2015 [7].
Several mobile apps use NST techniques, including DeepArt and Prisma.
Here are some more examples of stylizations being used to transform the same image of the riverbank town that we used earlier.
Neural style transfer combines content and style reconstruction. We need to do several things to get NST to work:
- choose a layer (or set of layers) to represent content — the middle layers are recommended (not too shall, not too deep) for best results.
- Minimize the total cost by using backpropagation.
- Initialize the input with random noise (necessary for generating gradients).
- Replacing max-pooling layers with average pooling to improve the gradient flow and to produce more appealing pictures.
Code Implementation
Now for the moment you’ve all been waiting for, the code to be able to make these images yourself. For clearer relationship between the code and the mathematical notation, please see the Jupyter notebook located in the GitHub repository.
Part 1: Import Necessary Functions
import time import numpy as np from keras import backend as K from keras.applications import vgg16, vgg19 from keras.preprocessing.image import load_img from scipy.misc import imsave from scipy.optimize import fmin_l_bfgs_b # preprocessing from utils import preprocess_image, deprocess_image %matplotlib inline
imports hosted with ❤ by GitHub
Part 2: Content Loss
We can generate an image that combines the content and style of a pair with a loss function that incorporates this information. This is achieved with two terms, one that mimics the specific activations of a certain layer for the content image, and a second term that mimics the style. The variable to optimize in the loss function will be a generated image that aims to minimize the proposed cost. Note that to optimize this function, we will perform gradient descent on the pixel values, rather than on the neural network weights.
We will load a trained neural network called VGG-16 proposed in 1, who secured the first and second place in the localization and classification tracks of ImageNet Challenge in 2014, respectively. This network has been trained to discriminate over 1000 classes over more than a million images. We will use the activation values obtained for an image of interest to represent the content and styles. In order to do so, we will feed-forward the image of interest and observe it’s activation values at the indicated layer.
The content loss function measures how much the feature map of the generated image differs from the feature map of the source image. We will only consider a single layer to represent the contents of an image.
def feature_reconstruction_loss(base, output): """ Compute the content loss for style transfer. Inputs: - output: features of the generated image, Tensor with shape [height, width, channels] - base: features of the content image, Tensor with shape [height, width, channels] Returns: - scalar content loss """ # YOUR CODE GOES HERE return K.sum(K.square(output - base)) # Test your code np.random.seed(1) base = np.random.randn(10,10,3) output = np.random.randn(10,10,3) a = K.constant(base) b = K.constant(output) test = feature_reconstruction_loss(a, b) print('Result: ', K.eval(test)) print('Expected result: ', 605.62195)
content loss hosted with ❤ by GitHub
Part 3: Style Loss
The style measures the similarity among filters in a set of layers. In order to compute that similarity, we will compute the Gram matrix of the activation values for the style layers. The Gram matrix is related to the empirical covariance matrix, and therefore, reflects the statistics of the activation values.
The output is a 2-D matrix which approximately measures the cross-correlation among different filters for a given layer. This, in essence, constitutes the style of a layer.
def gram_matrix(x): """ Computes the outer-product of the input tensor x. Input: - x: input tensor of shape (H, W, C) Returns: - tensor of shape (C, C) corresponding to the Gram matrix of the input image. """ # YOUR CODE GOES HERE features = K.batch_flatten(K.permute_dimensions(x, (2, 0, 1))) return K.dot(features, K.transpose(features)) # Test your code np.random.seed(1) x_np = np.random.randn(10,10,3) x = K.constant(x_np) test = gram_matrix(x) print('Result:\n', K.eval(test)) print('Expected:\n', np.array([[99.75723, -9.96186, -1.4740534], [-9.96186, 86.854324, -4.141108 ], [-1.4740534, -4.141108, 82.30106 ]]))
style loss hosted with ❤ by GitHub
Part 4: Style Loss — Layer’s Loss
In practice we compute the style loss at a set of layers rather than just a single layer; then the total style loss is the sum of style losses at each layer:
def style_reconstruction_loss(base, output): """ Computes the style reconstruction loss. It encourages the output img to have same stylistic features as style image. Inputs: - base: features at given layer of the style image. - output: features of the same length as base of the generated image. Returns: - style_loss: scalar style loss """ # YOUR CODE GOES HERE H, W = int(base.shape[0]), int(base.shape[1]) gram_base = gram_matrix(base) gram_output = gram_matrix(output) factor = 1.0 / float((2*H*W)**2) out = factor * K.sum(K.square(gram_output - gram_base)) return out # Test your code np.random.seed(1) x = np.random.randn(10,10,3) y = np.random.randn(10,10,3) a = K.constant(x) b = K.constant(y) test = style_reconstruction_loss(a, b) print('Result: ', K.eval(test)) print('Expected:', 0.09799164)
layer loss hosted with ❤ by GitHub
Part 5: Total-Variation Regularizer
We will also encourage smoothness in the image using a total-variation regularizer. This penalty term will reduce variation among the neighboring pixel values.
def total_variation_loss(x): """ Total variational loss. Encourages spatial smoothness in the output image. Inputs: - x: image with pixels, has shape 1 x H x W x C. Returns: - total variation loss, a scalar number. """ # YOUR CODE GOES HERE a = K.square(x[:, :-1, :-1, :] - x[:, 1:, :-1, :]) b = K.square(x[:, :-1, :-1, :] - x[:, :-1, 1:, :]) return K.sum(a + b) ## Test your code np.random.seed(1) x_np = np.random.randn(1,10,10,3) x = K.constant(x_np) test = total_variation_loss(x) print('Result: ', K.eval(test)) print('Expected:', 937.0538)
regularizer hosted with ❤ by GitHub
Part 6: Style Transfer
We now put it all together and generate some images! The style_transfer
function below combines all the losses you coded up above and optimizes for an image that minimizes the total loss. Read the code and comments to understand the procedure.
def style_transfer(base_img_path, style_img_path, output_img_path, convnet='vgg16', content_weight=3e-2, style_weights=(20000, 500, 12, 1, 1), tv_weight=5e-2, content_layer='block4_conv2', style_layers=['block1_conv1', 'block2_conv1', 'block3_conv1', 'block4_conv1', 'block5_conv1'], iterations=50): print('\nInitializing Neural Style model...') # Determine the image sizes. Fix the output size from the content image. print('\n\tResizing images...') width, height = load_img(base_img_path).size new_dims = (height, width) # Preprocess content and style images. Resizes the style image if needed. content_img = K.variable(preprocess_image(base_img_path, new_dims)) style_img = K.variable(preprocess_image(style_img_path, new_dims)) # Create an output placeholder with desired shape. # It will correspond to the generated image after minimizing the loss function. output_img = K.placeholder((1, height, width, 3)) # Sanity check on dimensions print("\tSize of content image is: {}".format(K.int_shape(content_img))) print("\tSize of style image is: {}".format(K.int_shape(style_img))) print("\tSize of output image is: {}".format(K.int_shape(output_img))) # Combine the 3 images into a single Keras tensor, for ease of manipulation # The first dimension of a tensor identifies the example/input. input_img = K.concatenate([content_img, style_img, output_img], axis=0) # Initialize the vgg16 model print('\tLoading {} model'.format(convnet.upper())) if convnet == 'vgg16': model = vgg16.VGG16(input_tensor=input_img, weights='imagenet', include_top=False) else: model = vgg19.VGG19(input_tensor=input_img, weights='imagenet', include_top=False) print('\tComputing losses...') # Get the symbolic outputs of each "key" layer (they have unique names). # The dictionary outputs an evaluation when the model is fed an input. outputs_dict = dict([(layer.name, layer.output) for layer in model.layers]) # Extract features from the content layer content_features = outputs_dict[content_layer] # Extract the activations of the base image and the output image base_image_features = content_features[0, :, :, :] # 0 corresponds to base combination_features = content_features[2, :, :, :] # 2 coresponds to output # Calculate the feature reconstruction loss content_loss = content_weight * feature_reconstruction_loss(base_image_features, combination_features) # For each style layer compute style loss # The total style loss is the weighted sum of those losses temp_style_loss = K.variable(0.0) # we update this variable in the loop weight = 1.0 / float(len(style_layers)) for i, layer in enumerate(style_layers): # extract features of given layer style_features = outputs_dict[layer] # from those features, extract style and output activations style_image_features = style_features[1, :, :, :] # 1 corresponds to style image output_style_features = style_features[2, :, :, :] # 2 coresponds to generated image temp_style_loss += style_weights[i] * weight * \ style_reconstruction_loss(style_image_features, output_style_features) style_loss = temp_style_loss # Compute total variational loss. tv_loss = tv_weight * total_variation_loss(output_img) # Composite loss total_loss = content_loss + style_loss + tv_loss # Compute gradients of output img with respect to total_loss print('\tComputing gradients...') grads = K.gradients(total_loss, output_img) outputs = [total_loss] + grads loss_and_grads = K.function([output_img], outputs) # Initialize the generated image from random noise x = np.random.uniform(0, 255, (1, height, width, 3)) - 128. # Loss function that takes a vectorized input image, for the solver def loss(x): x = x.reshape((1, height, width, 3)) # reshape return loss_and_grads([x])[0] # Gradient function that takes a vectorized input image, for the solver def grads(x): x = x.reshape((1, height, width, 3)) # reshape return loss_and_grads([x])[1].flatten().astype('float64') # Fit over the total iterations for i in range(iterations+1): print('\n\tIteration: {}'.format(i+1)) toc = time.time() x, min_val, info = fmin_l_bfgs_b(loss, x.flatten(), fprime=grads, maxfun=20) # save current generated image if i%10 == 0: img = deprocess_image(x.copy(), height, width) fname = output_img_path + '_at_iteration_%d.png' % (i) imsave(fname, img) print('\t\tImage saved as', fname) tic = time.time() print('\t\tLoss: {:.2e}, Time: {} seconds'.format(float(min_val), float(tic-toc)))
style transfer hosted with ❤ by GitHub
Part 6: Generate Pictures
Now we are ready to make some images, run your own compositions and test out variations of hyperparameters and see what you can come up with, I will give you an example below. The list of hyperparameters to vary is as follows:
- The base_img_path is the filename of content image.
- The style_img_path is the filename of style image.
- The output_img_path is the filename of the generated image.
- The convnet is for the neural network weights, VGG-16 or VGG-19.
- The content_layer specifies which layer to use for content loss.
- The content_weight weights the content loss in the overall composite loss function. Increasing the value of this parameter will make the final image look more realistic (closer to the original content).
- style_layers specifies a list of which layers to use for the style loss.
- style_weights specifies a list of weights to use for each layer in style_layers (each of which will contribute a term to the overall style loss). We generally use higher weights for the earlier style layers because they describe more local/smaller-scale features, which are more important to texture than features over larger receptive fields. In general, increasing these weights will make the resulting image look less like the original content and more distorted towards the appearance of the style image.
- tv_weight specifies the weighting of total variation regularization in the overall loss function. Increasing this value makes the resulting image look smoother and less jagged, at the cost of lower fidelity to style and content.
The following code will generate the front image of this article if run for 50 iterations.
params = { 'base_img_path' : 'images/inputs/chicago.jpg', 'style_img_path' : 'images/inputs/great_wave_of_kanagawa.jpg', 'output_img_path' : 'images/results/wave_chicago', 'convnet' : 'vgg16', 'content_weight' : 500, 'style_weights' : (10, 10, 50, 10, 10), 'tv_weight' : 200, 'content_layer' : 'block4_conv2', 'style_layers' : ['block1_conv1', 'block2_conv1', 'block3_conv1', 'block4_conv1', 'block5_conv1'], 'iterations' : 50 } style_transfer(**params)
example hosted with ❤ by GitHub
Here are a couple of rough examples from my own implementation after 50 iterations:
I recommend taking some of the images in the GitHub repository (or your own) and playing around with the hyperparameters and seeing what images you can make. However, to warn you, the training times are quite high unless you have access to a GPU, possibly taking several hours for one image.
DeepDream
DeepDream is a computer vision program created by Google engineer Alexander Mordvintsev which uses a convolutional neural network to find and enhance patterns in images via algorithm pareidolia, thus creating a dream-like hallucinogenic appearance in the deliberately over-processed images.
Google’s program popularized the term (deep) “dreaming” to refer to the generation of images that produce desired activations in a trained deep network, and the term now refers to a collection of related approaches.
Here is an example of an image transformed by DeepDream.
Inceptionism: Going Deeper into Neural Networks
We already have a reasonable intuition about what types of features are encapsulated by each of the layers in a neural network:
- First layer maybe looks for edges or corners.
- Intermediate layers interpret the basic features to look for overall shapes or components, like a door or a leaf.
- Final layers assemble those into complete interpretations: trees, buildings, etc.
This works fine for discriminative models, but what if we want to build a generative model? Say, for example, that you want to know what kind of image would result in a banana. One way to do this would be to turn the neural network upside down, start with an image full of random noise, and then gradually tweak the image towards what the neural net considers a banana.
By itself, this does not work particularly well, but if we impose a prior constraint that the image should have similar characteristics to natural images, such as a correlation between neighboring pixels, it becomes much more feasible.
Perhaps not surprisingly, neural networks trained to discriminate between different image classes have a substantial amount of information that is needed to generate images too.
This can be leveraged for the purpose of class generation, essentially flipping the discriminative model into a generative model. But why would we do this?
Well, let’s say you train a neural network to classify forks. You take thousands of images of forks and use them to train the network, and the network performs pretty well on data — but what is the network doing? What is the network using as its representation of what a fork is? This can be useful to ensure that the network is learning the right features and not cheating.
Visualizing mistakes
A good example of this cheating is with dumbbells. After training a network on a set of pictures of dumbbells, we use some random noise with prior constraints to ‘imagine’ some dumbbells and see what pops out, here is the result:
As we can see, the network always generates dumbbells with an arm. However, the network failed to completely distill the essence of a dumbbell — none of the pictures have any weightlifters in them, for example. Visualization can help us correct these kinds of training mishaps.
Enhancing feature maps
Instead of prescribing which feature we want the network to amplify, we can also let the network make that decision.
This can be done by feeding the network an image, and then picking a layer and asking the network to enhance whatever it detected. Lower layers tend to produce strokes or simple ornament-like patterns, such as this:
With higher-level layers, complex features or even whole objects tend to emerge. — these identify more sophisticated features. The process creates a feedback loop: if a cloud looks a little bit like a bird, the network will make it look more like a bird.
If we train on pictures of animals, the network will make it look more like an animal:
The results vary quite a bit with the kind of image because the features that are entered bias the network towards certain interpretations. For example, horizon lines tend to get filled with towers and pagodas. Rocks and trees turn into buildings. Birds and insects appear in images of leaves.
If we apply the algorithm iteratively on its own outputs and apply some zooming after each iteration, we get an endless stream of new impressions, exploring the set of things the network knows about.
DeepDream is a fascinating project, and I encourage the reader to look deeper (pardon the pun) into it if they are intrigued.
Final Comments
I hope you enjoyed the neural style transfer article and learned something new about style transfer, convolutional neural networks, or perhaps just enjoyed seeing the fascinating pictures generated by the deep neural networks of DeepDream.
References
[1] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge, “A neural algorithm of artistic style,” Aug. 2015.
[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[3] Matthew D. Zeiler and Rob Fergus, “Visualizing and understanding convolutional networks” in Computer Vision. 2014, pp. 818–833, Springer.
[4] Matthew D Zeiler, Graham W Taylor, and Rob Fergus, “Adaptive deconvolutional networks for mid and high-level feature learning,” in IEEE International Conference on Computer Vision (ICCV), 2011, pp. 2018–2025.
[5] Aravindh Mahendran and Andrea Vedaldi, “Understanding deep image representations by inverting them,” Nov. 2014.
[6] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge, “Texture synthesis using convolutional neural networks”.
[7] Gatys, Leon A.; Ecker, Alexander S.; Bethge, Matthias (26 August 2015). “A Neural Algorithm of Artistic Style”.
This article was originally published on Towards Data Science and re-published to TOPBOTS with permission from the author.
Enjoy this article? Sign up for more computer vision updates.
We’ll let you know when we release more technical education.
Leave a Reply
You must be logged in to post a comment.