Novel AI Methods For Video Generation for Marketing & Advertising

This research summary is part of our AI for Marketing series which covers the latest AI & machine learning approaches to 5 aspects of marketing automation:

AI algorithms can significantly increase the efficiency of the original content generation by offering entirely new approaches to creating videos for advertising campaigns. Do you want to create videos conditioned on captions that are provided in the marketing brief? Do you want to create animation when you have only a first and an end frame? Would you like to create a video with a talking model, while you have only a few shots of the respective person?

AI research teams from all over the world look for new solutions to video generation that require less and less data and as such are more applicable in the real-world setting. For your convenience, we’ve summarized here several recent approaches to video synthesis that can be of particular interest for marketers as they allow video generation from only a few still images.

If these accessible AI research analyses & summaries are useful for you, you can subscribe to receive our regular industry updates below.

If you’d like to skip around, here are the papers we featured:

To Create What You Tell: Generating Videos from Captions
Few-Shot Adversarial Learning of Realistic Neural Talking Head Models
From Here to There: Video Inbetweening Using Direct 3D Convolutions

Important Video Generation Research Papers

1. To Create What You Tell: Generating Videos from Captions, by Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li, Tao Mei

Original Abstract

We are creating multimedia contents every day and everywhere. While automatic content generation has played a fundamental challenge to multimedia community for decades, recent advances of deep learning have made this problem feasible. For example, the Generative Adversarial Networks (GANs) is a rewarding approach to synthesize images. Nevertheless, it is not trivial when capitalizing on GANs to generate videos. The difficulty originates from the intrinsic structure where a video is a sequence of visually coherent and semantically dependent frames. This motivates us to explore semantic and temporal coherence in designing GANs to generate videos. In this paper, we present a novel Temporal GANs conditioning on Captions, namely TGANs-C, in which the input to the generator network is a concatenation of a latent noise vector and caption embedding, and then is transformed into a frame sequence with 3D spatio-temporal convolutions. Unlike the naive discriminator which only judges pairs as fake or real, our discriminator additionally notes whether the video matches the correct caption. In particular, the discriminator network consists of three discriminators: video discriminator classifying realistic videos from generated ones and optimizes video-caption matching, frame discriminator discriminating between real and fake frames and aligning frames with the conditioning caption, and motion discriminator emphasizing the philosophy that the adjacent frames in the generated videos should be smoothly connected as in real ones. We qualitatively demonstrate the capability of our TGANs-C to generate plausible videos conditioning on the given captions on two synthetic datasets (SBMG and TBMG) and one real-world dataset (MSVD). Moreover, quantitative experiments on MSVD are performed to validate our proposal via Generative Adversarial Metric and human study.

Our Summary

The research team from Microsoft Research Asia explores video generation from captions. They suggest a novel architecture that is based on Generative Adversarial Networks (GANs) but with a number of adjustments that ensure semantic and temporal coherence of the generated videos. In particular, the discriminator network of the introduced Temporal GAN conditioning on Captions, namely TGAN-C, consists of three discriminators: (1) video discriminator to optimize video-caption matching; (2) frame discriminator to align frames with the conditioning caption; and (3) motion discriminator to ensure temporal coherence of the adjacent frames. The experiments confirm the effectiveness of the presented approach in generating coherent videos from captions.

Examples of video generation from captions

What’s the core idea of this paper?

The researchers address the problem of generating temporally coherent videos that are semantically aligned with a given descriptive sentence.
To ensure both temporal coherence and semantic match with a caption, the paper introduces Temporal GANs conditioning on Caption (TGANs-C):
- Caption embeddings are encoded by the Long Short-Term Memory (LSTM) network and concatenated with the noise vector and fed into the generator network.
- The generator network produces a sequence of video frames by utilizing 3D convolutions.
- The discriminator network consists of three discriminators:
  - a video discriminator for distinguishing between realistic and fake videos and optimizing video-caption matching;
  - a frame discriminator for distinguishing between real and fake frames and aligning frames with the conditioning caption;
  - a motion discriminator for distinguishing the displacement between consecutive real or generated frames and further enhancing temporal coherence.
TGANs-C is trained end-to-end by optimizing three losses, including video-level and frame-level matching-aware losses and temporal coherence loss.

Temporal GANs conditioning on Caption (TGANs-C) framework

What’s the key achievement?

A newly proposed TGANs-C architecture is one of the first attempts to generate videos conditioning on captions with the elegant approaches to solving the semantic and temporal coherence challenges.
Compared to several strong baselines, TGANs-C generates videos that are more realistic, relevant to a given caption, and temporally coherent.

What does the AI community think?

The research was recognized as a Brave New Idea at the 25th ACM International Conference on Multimedia (ACM MM 2017).

What are future research areas?

Synthesizing higher resolution videos.
Generating videos conditioning on open-vocabulary captions.
Extending the framework to the audio domain.

What are possible business applications?

The results of this research might be the first step towards automatic generation of video advertising campaigns based on the captions specified in a marketing brief.

2. Few-Shot Adversarial Learning of Realistic Neural Talking Head Models by Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, Victor Lempitsky

Original Abstract

Several recent works have shown how highly realistic human head images can be obtained by training convolutional neural networks to generate them. In order to create a personalized talking head model, these works require training on a large dataset of images of a single person. However, in many practical scenarios, such personalized talking head models need to be learned from a few image views of a person, potentially even a single image. Here, we present a system with such few-shot capability. It performs lengthy meta-learning on a large dataset of videos, and after that is able to frame few- and one-shot learning of neural talking head models of previously unseen people as adversarial training problems with high capacity generators and discriminators. Crucially, the system is able to initialize the parameters of both the generator and the discriminator in a person-specific way, so that training can be based on just a few images and done quickly, despite the need to tune tens of millions of parameters. We show that such an approach is able to learn highly realistic and personalized talking head models of new people and even portrait paintings.

Our Summary

The researchers from Samsung AI Center in Moscow propose a new solution to synthesizing personalized talking head sequences using only a few image views of a person. In contrast to the existing methods that require a huge dataset of images of a single person’s head to create a photorealistic video of this person, the suggested approach needs only up to eight images of the target person because it is pretrained on a large dataset of talking head videos corresponding to different speakers with diverse appearance. The experiments demonstrate that this system can generate a reasonable result even with a single photo while adding a few more photos results in highly realistic video models.

What’s the core idea of this paper?

Up to now, generating realistic videos of a single person’s head speaking required training convolutional neural networks on several minutes of video or thousands of still images of that person.
However, in practical scenarios, often only a few shots of a person are available. Thus, the authors suggest to pre-train a network on a large dataset of videos and still images of many different speakers, and then simply fine-tune the model with a few shots of a particular person.
Video frames are synthesized with adversarially-trained deep convolutional networks (ConvNets) using landmark tracks extracted from video sequences of the same person or face landmark tracks of a different person.
To learn how to transform landmark positions into realistically looking personalized images with only a few shots available, the system simulates few-shot learning tasks during the meta-learning.

What’s the key achievement?

Developing a low computational cost network that can generate talking head models based on as little as one still image.
Models trained on 32 still images achieve perfect realism based on asking viewers to distinguish fake talking head sequences from real ones of the same person.

What does the AI community think?

The paper has made waves in AI-related media outlets thanks to how much it has improved on existing methods for generating talking head models.
The paper has also been widely covered in the mainstream press because of its implications for making it less computationally difficult to create “deepfake” videos.

What are future research areas?

Exploring the ways to improve mimic representation and consider the gaze among other things.
Improving landmark adaptation so that landmarks from one face could be leveraged to generate another face without noticeable personality mismatch.

What are possible business applications?

The ability to quickly generate realistic talking head models of a particular person when there are only a few shots available can be successfully applied in synthesizing original content for marketing campaigns.

Where can you get implementation code?

The authors haven’t released the official implementation code but a PyTorch implementation of the introduced approach is available on GitHub.

3. From Here to There: Video Inbetweening Using Direct 3D Convolutions by Yunpeng Li, Dominik Roblek, Marco Tagliasacchi

Original Abstract

We consider the problem of generating plausible and diverse video sequences, when we are only given a start and an end frame. This task is also known as inbetweening, and it belongs to the broader area of stochastic video generation, which is generally approached by means of recurrent neural networks (RNN). In this paper, we propose instead a fully convolutional model to generate video sequences directly in the pixel domain. We first obtain a latent video representation using a stochastic fusion mechanism that learns how to incorporate information from the start and end frames. Our model learns to produce such latent representation by progressively increasing the temporal resolution, and then decode in the spatiotemporal domain using 3D convolutions. The model is trained end-to-end by minimizing an adversarial loss. Experiments on several widely-used benchmark datasets show that it is able to generate meaningful and diverse in-between video sequences, according to both quantitative and qualitative evaluations.

Our Summary

The Google Research team demonstrates a new method for creating plausible video sequences given only a start and an end frame – a process known as video inbetweening. They address the problem of video inbetweening using a fully convolutional model instead of traditional recurrent networks. A key component of the proposed inbetweening system is a 3D convolutional latent representation generator, which incorporates information from input frames to generate 3D convolutions with progressively increasing temporal resolution. Using this system, the researchers were able to synthesize 14 frames of stylistically consistent and meaningful video in between given start and end frames.

What’s the core idea of this paper?

Addressing the task of video inbetweening, with a fully convolutional model that has three key components:
- a 2D-convolutional image encoder to map the input frames to a latent space;
- a 3D-convolutional latent representation generator to incorporate information from the input frames and generate 3D convolutions with progressively increasing temporal resolution;
- a video generator to decode the latent representation into video frames.
Separating the generation of latent representation from the video decoding – the experiments show that generating video directly from the representation of input frames performs poorly.

What’s the key achievement?

Quantitative and qualitative evaluations on several widely-used datasets demonstrate the suggested approach generates meaningful and diverse video sequences using only a start and an end frame.
Moreover, this system produces twice as long video sequences (14 frames) as competing methods that are based on RNNs or optical flow.

What are future research areas?

Further investigating video generation using convolutions instead of traditional recurrent neural networks as they show very promising results according to this research.

What are possible business applications?

The presented approach to video inbetweening can be successfully leveraged by marketers:
- to generate animations from a few still images;
- to enhance video editing when the presented system can be used in place of interpolation for transitions and gapped sequences.

We’ll let you know when we release more summary articles like this one.

Important Video Generation Research Papers

1. To Create What You Tell: Generating Videos from Captions, by Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li, Tao Mei

Original Abstract

Our Summary

What’s the core idea of this paper?

What’s the key achievement?

What does the AI community think?

What are future research areas?

What are possible business applications?

2. Few-Shot Adversarial Learning of Realistic Neural Talking Head Models by Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, Victor Lempitsky

Original Abstract

Our Summary

What’s the core idea of this paper?

What’s the key achievement?

What does the AI community think?

What are future research areas?

What are possible business applications?

Where can you get implementation code?

3. From Here to There: Video Inbetweening Using Direct 3D Convolutions by Yunpeng Li, Dominik Roblek, Marco Tagliasacchi

Original Abstract

Our Summary

What’s the core idea of this paper?

What’s the key achievement?

What are future research areas?

What are possible business applications?

Enjoy this article? Sign up for more AI for marketing research updates.

Related

Reader Interactions

About Mariya Yao

Leave a Reply

Footer

About TOPBOTS