It is no secret that algorithms today can generate very realistic deepfakes – images or videos that are totally fake but very hard to distinguish from the real ones.
You can make Mark Zuckerberg talking about “one man with total control of billions of people stolen data” and the suspicion will come only because Mark is not likely to say these exact words while the video itself looks very realistic.
So, let’s see what are some of the state-of-the-art approaches to video generation.
If these summaries of scientific AI research papers are useful for you, you can subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.
If you’d like to skip around, here are the papers we featured:
- Video-to-Video Synthesis
- Everybody Dance Now
- Stochastic Video Generation with a Learned Prior
- MoCoGAN: Decomposing Motion and Content for Video Generation
Important Video Generation Research Papers
1. Video-to-Video Synthesis, by Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, Bryan Catanzaro
Original Abstract
We study the problem of video-to-video synthesis, whose goal is to learn a mapping function from an input source video (e.g., a sequence of semantic segmentation masks) to an output photorealistic video that precisely depicts the content of the source video. While its image counterpart, the image-to-image synthesis problem, is a popular topic, the video-to-video synthesis problem is less explored in the literature. Without understanding temporal dynamics, directly applying existing image synthesis approaches to an input video often results in temporally incoherent videos of low visual quality. In this paper, we propose a novel video-to-video synthesis approach under the generative adversarial learning framework. Through carefully-designed generator and discriminator architectures, coupled with a spatio-temporal adversarial objective, we achieve high-resolution, photorealistic, temporally coherent video results on a diverse set of input formats including segmentation masks, sketches, and poses. Experiments on multiple benchmarks show the advantage of our method compared to strong baselines. In particular, our model is capable of synthesizing 2K resolution videos of street scenes up to 30 seconds long, which significantly advances the state-of-the-art of video synthesis. Finally, we apply our approach to future video prediction, outperforming several state-of-the-art competing systems.
Our Summary
Researchers from NVIDIA have introduced a novel video-to-video synthesis approach. The framework is based on conditional GANs. Specifically, the method couples carefully-designed generator and discriminator with a spatio-temporal adversarial objective. The experiments demonstrate that the suggested vid2vid approach can synthesize high-resolution, photorealistic, temporally coherent videos on a diverse set of input formats including segmentation masks, sketches, and poses. It can also predict the next frames with far superior results than the baseline models.
What’s the core idea of this paper?
- Video frames can be generated sequentially, and the generation of each frame only depends on three factors:
- current source frame;
- past two source frames;
- past two generated frames.
- Using multiple discriminators can mitigate the mode collapse problem during GANs training:
- Conditional image discriminator ensures that each output frame resembles a real image given the same source image.
- Conditional video discriminator ensures that consecutive output frames resemble the temporal dynamics of a real video given the same optical flow.
- Foreground-background prior in the generator design further improves the synthesis performance of the proposed model.
- Using a soft occlusion mask instead of binary allows to better handle the “zoom in” scenario: we can add details by gradually blending the warped pixels and the newly synthesized pixels.
What’s the key achievement?
- Outperforming the strong baselines in video synthesis:
- Generating high-resolution (2048х2048), photorealistic, temporally coherent videos up to 30 seconds long.
- Outputting several videos with different visual appearances depending on sampling different feature vectors.
- Outperforming the baseline models in future video prediction.
- Open-sourcing a PyTorch implementation of the technique. This code can be used for:
- Converting semantic labels into realistic real-world videos.
- Generating multiple outputs of talking people from edge maps.
- Generating an entire human body given a pose.
What does the AI community think?
- “NVIDIA’s new vid2vid is the first open-source code that lets you fake anybody’s face convincingly from one source video. […] interesting times ahead…”, Gene Kogan, an artist and a programmer.
- The paper has also received some criticism over the concern that it can be used to create deepfakes or tampered videos which can deceive people.
What are future research areas?
- Using object tracking information to make sure that each object has a consistent appearance across the whole video.
- Researching if training the model with coarser semantic labels will help reduce the visible artifacts that appear after semantic manipulations (e.g., turning trees into buildings).
- Adding additional 3D cues, such as depth maps, to enable synthesis of turning cars.
What are possible business applications?
- Marketing and advertising can benefit from the opportunities created by the vid2vid method (e.g., replacing the face or even the entire body in the video). However, this should be used with caution, keeping in mind the ethical considerations.
Where can you get implementation code?
- NVIDIA team provides the original implementation of this research paper on GitHub.
2. Everybody Dance Now, by Caroline Chan, Shiry Ginosar, Tinghui Zhou, Alexei A. Efros
Original Abstract
This paper presents a simple method for “do as I do” motion transfer: given a source video of a person dancing we can transfer that performance to a novel (amateur) target after only a few minutes of the target subject performing standard moves. We pose this problem as a per-frame image-to-image translation with spatio-temporal smoothing. Using pose detections as an intermediate representation between source and target, we learn a mapping from pose images to a target subject’s appearance. We adapt this setup for temporally coherent video generation including realistic face synthesis. Our video demo can be found at https://youtu.be/PCBTZh41Ris.
Our Summary
UC Berkeley researchers present a simple method for generating videos with amateur dancers performing like professional dancers. If you want to take part in the experiment, all you need to do is to record a few minutes of yourself performing some standard moves and then pick up the video with the dance you want to repeat. The neural network will do the main job: it solves the problem as a per-frame image-to-image translation with spatio-temporal smoothing. By conditioning the prediction at each frame on that of the previous time step for temporal smoothness and applying a specialized GAN for realistic face synthesis, the method achieves really amazing results.
What’s the core idea of this paper?
- “Do as I do” motion transfer is approached as a per-frame image-to-image translation with the pose stick figures as an intermediate representation between source and target:
- A pre-trained state-of-the-art pose detector creates pose stick figures from the source video.
- Global pose normalization is applied to account for differences between the source and target subjects in body shapes and locations within the frame.
- Normalized pose stick figures are mapped to the target subject.
- To make videos smooth, the researchers suggest conditioning the generator on the previously generated frame and then giving both images to the discriminator. Gaussian smoothing on the pose keypoints allows to further reduce jitter.
- To generate more realistic faces, the method includes an additional face-specific GAN that brushes up the face after the main generation is finished.
What’s the key achievement?
- Suggesting a novel approach to motion transfer that outperforms a strong baseline (pix2pixHD), according to both qualitative and quantitative assessments.
- Demonstrating that face-specific GAN adds considerable detail to the output video.
What does the AI community think?
- “Overall I thought this was really fun and well executed. Looking forward to the code release so that I can start training my dance moves.”, Tom Brown, member of technical staff at Google Brain.
- “’Everybody Dance Now’ from Caroline Chan, Alyosha Efros and team transfers dance moves from one subject to another. The only way I’ll ever dance well. Amazing work!!!”, Soumith Chintala, AI Research Engineer at Facebook.
What are future research areas?
- Replacing pose stick figures with temporally coherent inputs and representation specifically optimized for motion transfer.
What are possible business applications?
- “Do as I do” motion transfer might be applied to replace subjects when creating marketing and promotional videos.
Where can you get implementation code?
- PyTorch implementation of this research paper is available on GitHub.
3. Stochastic Video Generation with a Learned Prior, by Emily Denton, Rob Fergus
Original Abstract
Generating video frames that accurately predict future world states is challenging. Existing approaches either fail to capture the full distribution of outcomes, or yield blurry generations, or both. In this paper we introduce an unsupervised video generation model that learns a prior model of uncertainty in a given environment. Video frames are generated by drawing samples from this prior and combining them with a deterministic estimate of the future frame. The approach is simple and easily trained end-to-end on a variety of datasets. Sample generations are both varied and sharp, even many frames into the future, and compare favorably to those from existing approaches.
Our Summary
Extending video sequences is problematic when there is a stochastic event, such as a ball bouncing against the ground, that leads to many possible future frames. Until now, video generation methods for stochastic events have resulted in blurry frames as multiple possible futures are accommodated simultaneously. Denton and Fergus created a new stochastic video generation model that combines a deterministic frame predictor with time-dependent stochastic latent variables. It learns a prior model of uncertainty in a given environment, and then generates video frames by drawing samples from these priors and combining them with a deterministic estimate of the next frame. Using this model, the researchers were able to generate sharp and realistic video sequences many frames into the future.
What’s the core idea of this paper?
- This research addresses the problem of generating video sequences when it’s challenging to predict future world states because of a stochastic event.
- To this end, the authors introduce an unsupervised video generation model that combines a deterministic prediction of the next frame with stochastic latent variables, drawn from time-dependent priors.
- The majority of frames up until the stochastic event can be treated as deterministic, and only at the point of a stochastic event, the modeling of uncertainty becomes important.
What’s the key achievement?
- The introduced model can produce sharp and realistic frames far into the future after a dynamic event (up to 100 frames).
- Compared to a similar model without a learned prior, this model produces sharper frames over longer timescales and better captures the distribution of possible outcomes.
- End-to-end training of the model is simpler than for existing stochastic video generation approaches.
What does the AI community think?
- The paper was accepted to the 35th International Conference on Machine Learning (ICML 2018).
What are future research areas?
- Applying this latent-prior stochastic video generation model to more complex video sequences with multiple stochastic events.
What are possible business applications?
- Automated extension of potentially complex video and animation sequences.
Where can you get implementation code?
- Implementation code is available on Github.
4. MoCoGAN: Decomposing Motion and Content for Video Generation, by Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz
Original Abstract
Visual signals in a video can be divided into content and motion. While content specifies which objects are in the video, motion describes their dynamics. Based on this prior, we propose the Motion and Content decomposed Generative Adversarial Network (MoCoGAN) framework for video generation. The proposed framework generates a video by mapping a sequence of random vectors to a sequence of video frames. Each random vector consists of a content part and a motion part. While the content part is kept fixed, the motion part is realized as a stochastic process. To learn motion and content decomposition in an unsupervised manner, we introduce a novel adversarial learning scheme utilizing both image and video discriminators. Extensive experimental results on several challenging datasets with qualitative and quantitative comparison to the state-of-the-art approaches, verify effectiveness of the proposed framework. In addition, we show that MoCoGAN allows one to generate videos with same content but different motion as well as videos with different content and same motion.
Our Summary
The team from Snap Research and NVIDIA have developed a framework for video generation which can learn to distinguish between the objects in a video and their movements, and can alter these independently of each other: given a video, it can generate a video of the same object performing different movements, or a different object performing the same movements. The new framework, MoCoGAN, does this using a Generative Adversarial Network (GAN), which begins by generating random noise vectors for content and for motion in each frame of a video. It uses a Gaussian distribution to model the vector space for the content, because this usually stays the same from one frame to the next, and a recurrent neural network (RNN) to model the one for the motion. The experiments in the paper show that MoCoGAN outperforms state-of-the-art frameworks at video generation and next frame prediction.
What’s the core idea of this paper?
- MoCoGAN’s novel contribution is the separation of content and motion.
- Like other frameworks, it uses a GAN to randomly generate starting vectors and make them converge on the desired output.
- Modeling content and motion separately allows it to learn, unsupervised, to distinguish them, which means it can vary each one independently of the other, or predict each one independently of the other.
What’s the key achievement?
- The experiments in the paper show that MoCoGAN outperforms state-of-the-art frameworks at video generation and next-frame prediction.
- In facial expression generation, MoCoGAN performs 40% better than VGAN and 34% better than TGAN, according to the Average Content Distance (ACD) metric.
What does the AI community think?
- The paper was presented at CVPR 2018, the leading conference in computer vision.
What are possible business applications?
- With MoCoGAN, it’s possible to generate lots of variants on a video, which could then be A/B tested for viral marketing.
- MoCoGAN can be used to change the facial expression of a person in a video, or to replace a person with a different person with the same facial expression.
- A video featuring one product could be automatically edited to showcase a different product (different content) in the same setting (same motion) to reduce video production costs.
- Videos could be automatically edited to remove elements which conflict with the desired brand image and replace them with better alternatives.
Where can you get implementation code?
- The authors have released the code on GitHub here.
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more summary articles like this one.
Leave a Reply
You must be logged in to post a comment.