This research summary is part of our AI for Marketing series which covers the latest AI & machine learning approaches to 5 aspects of marketing automation:
- Attribution
- Optimization
- Personalization
- Analytics
- Content Generation: Images
- Content Generation: Videos
- Content Generation: Text
After Generative Adversarial Networks (GANs) were introduced by Ian Goodfellow in 2014, a whole new era for AI image synthesis began. Starting with small, blurry, black-and-white pictures back in 2014, GANs have demonstrated the tremendous progress during the last five years, and the latest GAN-based systems generate high-resolution, realistic and colorful pictures that are hardly distinguishable from real photographs.
Feel skeptical? Then check these impressive pictures of people that actually do not exist.
So why not to use these awesome GAN capabilities for generating original marketing content? To help you navigate through the numerous GAN-related research advances introduced during the last years, we’ve summarized for you several important research papers that demonstrate how AI image synthesis can be applied in marketing.
If these accessible AI research analyses & summaries are useful for you, you can subscribe to receive our regular industry updates below.
If you’d like to skip around, here are the papers we featured:
- LoGAN: Generating Logos with a Generative Adversarial Neural Network Conditioned on Color
- Unsupervised Person Image Generation with Semantic Parsing Transformation
- Persuasive Faces: Generating Faces in Advertisements
- SC-FEGAN: Face Editing Generative Adversarial Network with User’s Sketch and Color
- MirrorGAN: Learning Text-to-image Generation by Redescription
- High-Fidelity Image Generation With Fewer Labels
- Enabling Hyper-Personalisation: Automated Ad Creative Generation and Ranking for Fashion e-Commerce
Important Image Generation Research Papers
1. LoGAN: Generating Logos with a Generative Adversarial Neural Network Conditioned on Color by Ajkel Mino and Gerasimos Spanakis
Original Abstract
Designing a logo is a long, complicated, and expensive process for any designer. However, recent advancements in generative algorithms provide models that could offer a possible solution. Logos are multi-modal, have very few categorical properties, and do not have a continuous latent space. Yet, conditional generative adversarial networks can be used to generate logos that could help designers in their creative process. We propose LoGAN: an improved auxiliary classifier Wasserstein generative adversarial neural network (with gradient penalty) that is able to generate logos conditioned on twelve different colors. In 768 generated instances (12 classes and 64 logos per class), when looking at the most prominent color, the conditional generation part of the model has an overall precision and recall of 0.8 and 0.7 respectively. LoGAN’s results offer a first glance at how artificial intelligence can be used to assist designers in their creative process and open promising future directions, such as including more descriptive labels which will provide a more exhaustive and easy-to-use system.
Our Summary
The researchers from Maastricht University introduce a novel approach to logo generation called LoGAN. The approach is based on generative adversarial network (GAN), namely the Auxiliary Classifier Wasserstein GAN that enables conditioning of generated images on certain labels. In the paper, logo generation is conditioned on twelve different colors. The experiments demonstrate that the generated logos resemble the feasible ones. Though, they have very low resolution and can serve only as a very rough first draft of a final logo.
What’s the core idea of this paper?
- The task of logo generation is challenging because logos are multi-modal, their latent space is not continuous, and they don’t contain a hierarchy of nested segments that networks can learn and reproduce.
- The solution can be to generate logos conditioned on certain labels, and in this paper, the researchers suggest to condition logo generation on twelve colors. The color is an informative label and provides designers with a certain flexibility.
- The proposed model, called LoGAN, is an Auxiliary Classifier Wasserstein Generative Adversarial Neural Network with gradient penalty (AC-WGAN-GP):
- It consists of three neural networks, namely the discriminator, the generator, and a classification network.
- An additional classification network assists the discriminator in classifying the logos to compensate for the dominance of the Wasserstein distance in the original classification loss from AC-GAN.
- For avoiding instability and mode collapse, the generator and classifier are trained once for every 5 iterations of discriminator training.
What’s the key achievement?
- The proposed model generates logos that resemble the real ones:
- even though the generated images are dominated by round and square shapes, irregular shapes have also been generated (e.g., the heart and the x among the white logos);
- as the training images are only 32 x 32 pixels, the generated logos are blurred.
What does the AI community think?
- The paper was presented at ICMLA 2018, the 17th IEEE International Conference on Machine Learning and Applications.
What are future research areas?
- Conditioning on more labels such as:
- the shape of the logo;
- the most used words to describe the company that the logo belongs to.
- Introducing the semantic meaning to the logos by combining labels that define the focus of the company with word embedding models.
What are possible business applications?
- LoGAN can assist designers in their creative process for a logo design:
- the generated images are of very low resolution but they can serve as a very rough first draft or just as a means of inspiration for the designer;
- in the paper, logos are generated conditioned on twelve colors but the proposed model can be also used to create logos given a certain keyword.
Where can you get implementation code?
- The authors provide implementation of LoGAN on GitHub.
2. Unsupervised Person Image Generation with Semantic Parsing Transformation by Sijie Song, Wei Zhang, Jiaying Liu, Tao Mei
Original Abstract
In this paper, we address unsupervised pose-guided person image generation, which is known challenging due to non-rigid deformation. Unlike previous methods learning a rock-hard direct mapping between human bodies, we propose a new pathway to decompose the hard mapping into two more accessible subtasks, namely, semantic parsing transformation and appearance generation. Firstly, a semantic generative network is proposed to transform between semantic parsing maps, in order to simplify the non-rigid deformation learning. Secondly, an appearance generative network learns to synthesize semantic-aware textures. Thirdly, we demonstrate that training our framework in an end-to-end manner further refines the semantic maps and final results accordingly. Our method is generalizable to other semantic-aware person image generation tasks, eg, clothing texture transfer and controlled image manipulation. Experimental results demonstrate the superiority of our method on DeepFashion and Market-1501 datasets, especially in keeping the clothing attributes and better body shapes.
Our Summary
The research paper introduces a new approach to unsupervised person image generation. To deal with the complexity of learning a direct mapping under different poses, the researchers suggest decomposing the hard task of pose-guided image generation into two steps, namely semantic parsing transformation and appearance generation. The qualitative and quantitative evaluation of the introduced approach demonstrates that this model outperforms other state-of-the-art approaches in rendering better body shape and keeping clothing attributes. Moreover, this approach is generalizable to other conditional image generation tasks such as clothing texture transfer and controlled image manipulation.
What’s the core idea of this paper?
- The proposed model for unsupervised person image generation consists of two modules:
- semantic parsing transformation module with the semantic generation network employed for transforming the input semantic parsing to the target parsing, according to the target pose;
- appearance generation module with the generative network for synthesizing textures on the transformed parsing.
- The model is trained in an end-to-end manner for a better semantic map prediction and improved final results.
What’s the key achievement?
- Qualitative comparison with several state-of-the-art models demonstrates that the proposed approach generates more realistic images with higher visual quality and fewer artifacts. The approach performs especially well with respect to clothing attributes such as textures and clothing type.
- Quantitative results are in line with the results of the qualitative comparison and show that the introduced model achieves the best Inception Score (IS) value on both DeepFashion and Market-1501 datasets, even compared with supervised methods.
What does the AI community think?
- The paper was accepted for oral presentation at CVPR 2019, the premier computer vision conference.
What are future research areas?
- Training the human parser and person image generation model jointly.
What are possible business applications?
- The introduced approach to pose-guided image generation can have a number of possible applications in the fashion industry and e-commerce business:
- generating an image that follows the clothing appearance of the condition image but in a different pose;
- transferring clothing texture given the condition and target images;
- performing controlled image manipulation (e.g., editing the sleeve length, changing the dress to pants, etc).
Where can you get implementation code?
- The researchers provide PyTorch implementation of the introduced approach to unsupervised person image generation.
3. Persuasive Faces: Generating Faces in Advertisements by Christopher Thomas and Adriana Kovashka
Original Abstract
In this paper, we examine the visual variability of objects across different ad categories, i.e. what causes an advertisement to be visually persuasive. We focus on modeling and generating faces which appear to come from different types of ads. For example, if faces in beauty ads tend to be women wearing lipstick, a generative model should portray this distinct visual appearance. Training generative models which capture such category-specific differences is challenging because of the highly diverse appearance of faces in ads and the relatively limited amount of available training data. To address these problems, we propose a conditional variational autoencoder which makes use of predicted semantic attributes and facial expressions as a supervisory signal when training. We show how our model can be used to produce visually distinct faces which appear to be from a fixed ad topic category. Our human studies and quantitative and qualitative experiments confirm that our method greatly outperforms a variety of baselines, including two variations of a state-of-the-art generative adversarial network, for transforming faces to be more ad-category appropriate. Finally, we show preliminary generation results for other types of objects, conditioned on an ad topic.
Our Summary
The research team from the University of Pittsburg draws our attention to the fact that faces in advertisements significantly vary across different ad types – for example, a female face in the beauty ad will be vastly different from a face used in a social campaign on domestic violence. Thus, the researchers introduce a conditional variational autoencoder that is able to produce visually distinct faces from fixed ad topic categories. Results from qualitative and quantitative evaluation demonstrate that the proposed approach outperforms several strong baselines in transforming faces to be appropriate for specific ad categories.
What’s the core idea of this paper?
- The paper introduces a novel generative approach for modifying the appearances of faces depending on the ad category:
- the approach leverages semantics learned on larger datasets;
- thus, the model learns how faces differ in terms of predicted attributes and facial expressions instead of relying on the differences revealed on the pixel level.
- The introduced method includes the following steps:
- training facial expression and facial attribute classifiers on the existing datasets;
- detecting faces in ads and predicting their attributes and expressions;
- training a conditional variational autoencoder on the resulting dataset of ad faces so that it learns how to reconstruct an ad face from a vector comprised of a learned latent representation, facial attributes, and facial expressions;
- embedding all ad faces into vector space and computing how faces differ across ad topics;
- using these learned differences to transform embeddings of other ad faces into each ad category;
- using a decoder to generate distinct faces across ad categories.
What’s the key achievement?
- Comparison with strong baselines, including the architectures based on autoencoders and GANs, shows that the proposed method is the most faithful in transferring the topic-specific facial appearance.
- When asked which method is the best at portraying the distinct visual appearance of faces across the five ad topics, the human judges were selecting the introduced approach 4 times as often as the next best method.
- The experiments also show that this model can be used to generate other objects besides faces as they appear in different ad categories (e.g., bottles from beauty, alcohol, and soda advertisements).
What does the AI community think?
- The paper was presented at the British Machine Vision Conference (BMVC 2018).
What are future research areas?
- Investigating techniques for reliably generating other objects besides faces, and for creating complete ads with multiple objects and persuasive slogans.
What are possible business applications?
- The proposed approach can be the first step to automated ad generation with the visual appearance of all objects adapted to the specific ad category.
4. SC-FEGAN: Face Editing Generative Adversarial Network with User’s Sketch and Color by Youngjoo Jo and Jongyoul Park
Original Abstract
We present a novel image editing system that generates images as the user provides free-form mask, sketch and color as an input. Our system consists of an end-to-end trainable convolutional network. Contrary to the existing methods, our system wholly utilizes free-form user input with color and shape. This allows the system to respond to the user’s sketch and color input, using it as a guideline to generate an image. In our particular work, we trained network with additional style loss which made it possible to generate realistic results, despite large portions of the image being removed. Our proposed network architecture SC-FEGAN is well suited to generate high-quality synthetic image using intuitive user inputs.
Our Summary
In this paper, the researchers propose SC-FEGAN, a neural network based face image editing system that generates high-quality synthetic images with realistic texture details from free-form input. This approach enables automated image editing so that you can easily add, change or remove glasses or earrings, change hairstyles, face shapes, eyes, mouth, etc. Furthermore, the SC-FEGAN model can be also used to restore faces even if a lot of regions are erased or generate face image with only sketch and color as input data. The experiments demonstrate that the introduced method outperforms several strong baselines in generating high-quality images using free-form input.
What’s the core idea of this paper?
- The research team presents a new image editing system SC-FEGAN for free-form masks, sketches, colors inputs. The system consists of:
- a generator that is based on encoder-decoder architecture like the U-net, and
- a discriminator that is based on SN-patchGAN.
- In addition to general GAN loss, the system is also concurrently trained with style loss to edit the parts of the face image even if a large area is missing. The style loss also enables adding details such as high-quality synthetic hairstyle or earrings.
What’s the key achievement?
- SC-FEGAN outperforms other state-of-the-art approaches in generating high-quality and realistic face images using free-form input from users:
- The experiments demonstrate a variety of successful and realistic editing results, even in very challenging cases.
- The system is excellent at modifying and restoring large regions in one pass.
- The introduced approach requires minimal efforts from the users.
What does the AI community think?
- The GitHub repository with the source code for the SC-FEGAN model got high popularity among machine learning practitioners as indicated by over 2,500 stars given to it by GitHub users.
What are future research areas?
- Exploring the ways to further improve the quality and realism of generated images, especially in the case of generating a new image based on the sketch and color input only.
What are possible business applications?
- The introduced system can be of great help for ad designers as it:
- produces high-quality and realistic faces images;
- accepts intuitive inputs such as sketching and coloring;
- automates image editing;
- requires minimal efforts from users.
Where can you get implementation code?
- The TensorFlow implementation of the SC-FEGAN model, the detailed instructions on using this system, and some additional experimental results can be found on GitHub.
5. MirrorGAN: Learning Text-to-image Generation by Redescription, by Tingting Qiao, Jing Zhang, Duanqing Xu, Dacheng Tao
Original Abstract
Generating an image from a given text description has two goals: visual realism and semantic consistency. Although significant progress has been made in generating high-quality and visually realistic images using generative adversarial networks, guaranteeing semantic consistency between the text description and visual content remains very challenging. In this paper, we address this problem by proposing a novel global-local attentive and semantic-preserving text-to-image-to-text framework called MirrorGAN. MirrorGAN exploits the idea of learning text-to-image generation by redescription and consists of three modules: a semantic text embedding module (STEM), a global-local collaborative attentive module for cascaded image generation (GLAM), and a semantic text regeneration and alignment module (STREAM). STEM generates word- and sentence-level embeddings. GLAM has a cascaded architecture for generating target images from coarse to fine scales, leveraging both local word attention and global sentence attention to progressively enhance the diversity and semantic consistency of the generated images. STREAM seeks to regenerate the text description from the generated image, which semantically aligns with the given text description. Thorough experiments on two public benchmark datasets demonstrate the superiority of MirrorGAN over other representative state-of-the-art methods.
Our Summary
In this paper, the authors address the problem of generating realistic images that match a given text description. They introduce a novel global-local attentive text-to-image-to-text framework called MirrorGAN. It exploits the idea that if the generated image is semantically consistent with a given text description, its redescription created through image-text translation should have exactly the same semantics as the original text description. Thus, in addition to visual realism adversarial loss and text-image paired semantic consistency adversarial loss, the model also includes a text-semantics reconstruction loss based on cross-entropy. The experiments on two public datasets demonstrate that MirrorGAN outperforms other representative state-of-the-art methods with respect to both visual realism and semantic consistency.
Learning text-to-image generation by redescription
What’s the core idea of this paper?
- To generate visually realistic images that are consistent with a given text description, the authors introduce a novel text-to-image-to-text framework called MirrorGAN.
- The model exploits the idea of learning text-to-image generation by redescription.
- It includes three modules:
- a semantic text embedding module (STEM) for generating word- and sentence-level embeddings;
- a global-local collaborative attentive module (GLAM) for cascaded image generation;
- a semantic text regeneration and alignment module (STREAM) for regenerating the text description from the generated image.
- The model uses two adversarial losses to ensure visual realism and text-image paired semantic consistency, and also employs a text-semantics reconstruction loss based on cross-entropy.
What’s the key achievement?
- The experiments on the CUB and COCO datasets demonstrate that MirrorGAN outperforms the state-of-the-art AttnGAN by:
- improving the Inception Score from 4.36 to 4.56 on CUB and from 25.89 to 26.47 on the COCO dataset, implying higher diversity and better quality of the generated images;
- getting significantly higher R-precision scores, implying higher semantic consistency of the generated images;
- generating more convincing images, according to the results of the human perceptual test.
What does the AI community think?
- The paper was presented at CVPR 2019, the leading conference in computer vision.
What are future research areas?
- Optimizing the MirrorGAN modules jointly with complete end-to-end training.
- Employing a more advanced language model like BERT for text embedding and image captioning.
What are possible business applications?
- The introduced approach to generating realistic images from a given text description can be leveraged in advertising to automatically generate ad creatives.
Where can you get implementation code?
- The PyTorch implementation of MirrorGAN is available on GitHub.
6. High-Fidelity Image Generation With Fewer Labels, by Mario Lucic, Michael Tschannen, Marvin Ritter, Xiaohua Zhai, Olivier Bachem, Sylvain Gelly
Original Abstract
Deep generative models are becoming a cornerstone of modern machine learning. Recent work on conditional generative adversarial networks has shown that learning complex, high-dimensional distributions over natural images is within reach. While the latest models are able to generate high-fidelity, diverse natural images at high resolution, they rely on a vast quantity of labeled data. In this work we demonstrate how one can benefit from recent work on self- and semi-supervised learning to outperform the state of the art on both unsupervised ImageNet synthesis, as well as in the conditional setting. In particular, the proposed approach is able to match the sample quality (as measured by FID) of the current state-of-the-art conditional model BigGAN on ImageNet using only 10% of the labels and outperform it using 20% of the labels.
Our Summary
The Google Research team investigates several directions for reducing the appetite for labeled data in state-of-the-art Generative Adversarial Networks (GANs). In particular, they show that the recent advances in self-supervised and semi-supervised learning can be leveraged to significantly reduce the amount of ground-truth label information required for natural image generation, while still achieving state-of-the-art results. Namely, they demonstrate that a pre-trained self-supervised approach can match the state-of-the-art performance of BigGAN using only 20% labeled data. Moreover, self-supervision during GAN training leads to even better model performance that can match the state-of-the-art BigGAN with only 10% of the labels and outperform it with 20% of the labels.
What’s the core idea of this paper?
- The paper investigates several avenues for reducing the need for labeled data in natural image generation:
- Pre-trained approaches:
- Unsupervised clustering-based method, where cluster assignments are used as a replacement for labels.
- Semi-supervised method, where the above-mentioned approach is extended with a semi-supervised loss.
- Co-training approaches:
- Unsupervised method, where the authors experiment with a single label assigned to all examples and random labels assigned to real images.
- Semi-supervised method, where labels for the unlabeled images are predicted based on the available labels.
- Self-supervision during GAN training, where the discriminator is augmented with an auxiliary task – self-supervision through rotation prediction.
- Pre-trained approaches:
What’s the key achievement?
- Achieving a new state of the art in unsupervised image generation with the unsupervised clustering-based approach.
- Getting state-of-the-art performance using 20% labeled data with the pretrained semi-supervised approach.
- Matching the state-of-the-art BigGAN with only 10% of the labels and outperforming it with 20% labels by applying self-supervision during GAN training.
What does the AI community think?
- The paper was accepted for oral presentation at ICML 2019, one of the key conferences in machine learning.
What are future research areas?
- The authors suggest the following directions for future work:
- exploring the applicability of the introduced techniques for even larger and more diverse datasets than ImageNet;
- investigating the impact of other self-supervised and semi-supervised approaches on the model performance;
- investigating the impact of self-supervision in other deep generative models.
What are possible business applications?
- Considering the scarcity of labeled data in the business setting, the presented techniques can be very beneficial for companies seeking to deploy GANs for automated generation of high-quality ad creatives.
Where can you get implementation code?
- The authors open-source all the code used in their experiments on GitHub.
7. Enabling Hyper-Personalisation: Automated Ad Creative Generation and Ranking for Fashion e-Commerce, by Sreekanth Vempati, Korah T Malayil, Sruthi V, Sandeep R
Original Abstract
A homepage is the first touch point in the customer’s journey and is one of the prominent channels of revenue for many e-commerce companies. A user’s attention is mostly captured by homepage banner images (also called Ads/Creatives). The set of banners shown and their design influence the customer’s interest and play a key role in optimizing the click-through rates of the banners. Presently, massive and repetitive effort is put in, to manually create aesthetically pleasing banner images. Due to the large amount of time and effort involved in this process, only a small set of banners are made live at any point. This reduces the number of banners created as well as the degree of personalization that can be achieved. This paper thus presents a method to generate creatives automatically on a large scale in a short duration. The availability of diverse banners generated helps in improving personalization as they can cater to the taste of larger audience. The focus of our paper is on generating wide variety of homepage banners that can be made as an input for user-level personalization engine. Following are the main contributions of this paper: 1) We introduce and explain the need for large scale banner generation for e-commerce 2) We present on how we utilize existing deep learning-based detectors which can automatically annotate the required objects/tags from the image. 3) We also propose a Genetic Algorithm based method to generate an optimal banner layout for the given image content, input components and other design constraints. 4) Further, to aid the process of picking the right set of banners, we designed a ranking method and evaluated multiple models. All our experiments have been performed on data from Myntra, one of the top fashion e-commerce players in India.
Our Summary
In this paper, the research team from Myntra, one of the leading e-commerce players in India, introduces its approach to hyper-personalization of homepage banners. Manual creation of banners requires lots of hours invested in searching through image libraries, selecting font colors, size, and typography, image transformation, and finally, combining all the elements into an aesthetically appealing banner. As a result, only a few banners are usually available, which doesn’t allow a significant degree of personalization. At Myntra, they leverage a genetic algorithm-based method that automatically generates banners using a library of design elements. To pick the right set of banners from the generated ones, they use a ranking method built on banner meta-data. The online A/B test demonstrates that hyper-personalization enabled by automatic banner generation results in a 72% increase in click-through rate (CTR).
What’s the core idea of this paper?
- To enable the hyper-personalization of ad banners, they need to be created automatically.
- The Myntra research team suggests the following pipeline for automatic generation of ad creatives such as homepage banners:
- large-scale automated annotation of all available images and tagging each of them with the relevant data;
- feeding the annotated data to the layout generation module and further to the creation module;
- re-ranking the generated banners with a model built on historical data.
End-to-end pipeline for automatic creation of banners
What’s the key achievement?
- Introducing a novel approach to automatic generation of ad creatives which, according to the experiments:
- results in a significant CTR increase (by 72%);
- includes a ranking model that evaluates generated banners in line with human judgment.
What does the AI community think?
- The paper was presented during the Workshop on Recommender Systems in Fashion within RecSys 2019, the 13th ACM Conference on Recommender Systems.
What are future research areas?
- Exploring the opportunity to further boost personalization by performing online ranking via reinforcement learning.
What are possible business applications?
- The introduced approach to the automatic generation of aesthetically appealing banners can be leveraged by e-commerce companies, social messaging platforms, and online video content providers.
Enjoy this article? Sign up for more AI for marketing research updates.
We’ll let you know when we release more summary articles like this one.
Geraldine E. Walker says
Hello. This article is so useful. Thanks.