Just two years ago, text generation models were so unreliable that you needed to generate hundreds of samples in hopes of finding even one plausible sentence.
Nowadays, OpenAI’s pre-trained language model can generate relatively coherent news articles given only two sentence of context. Other approaches like Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) have also become much better at text generation.
Here we have summarized for you 4 recently introduced research papers that bring text generation to the next level. Two of the papers rely on GAN architecture, one paper improves the VAE approach, and one paper relies on the pre-trained language model for text generation.
If these accessible AI research analyses & summaries are useful for you, you can subscribe to receive our regular industry updates below.
We have already covered pre-trained language models in a previous article, so let’s now briefly define GAN and VAE methods.
What is a GAN?
- Generative Adversarial Network (GAN) is a framework for training generative models in an adversarial setup. It consists of two networks, a generator and a discriminator:
- a generator creates object instances (e.g., images, sentences) and tries to fool a discriminator;
- a discriminator is trained to discriminate between real and synthetic object instances.
- GANs are very successful at generating realistic images but they have only seen limited use for text sequences. This is due to GANs being originally designed to output differentiable values, and thus, discrete language generation is challenging for them.
What is a VAE?
- A standard autoencoder consists of two neural networks:
- an encoder comprised of convolutional layers that encode an object (image, text, sound) into a latent vector; and
- a decoder comprised of deconvolutional layers that decode a latent vector back into the object.
- The standard autoencoder network simply reconstructs the data but cannot generate new objects. To enable data generation, the variational autoencoder (VAE) requires an additional feature that allows it to learn the latent representations of the inputs as soft ellipsoidal regions rather than isolated data points.
- Thus, new data can be generated by sampling latent vectors from the latent space and passing them into the decoder.
If you’d like to skip around, here are the papers we featured:
- Long Text Generation via Adversarial Training with Leaked Information
- MaskGAN: Better Text Generation via Filling in the______
- Lagging Inference Networks and Posterior Collapse in Variational Autoencoder
- Language Models are Unsupervised Multitask Learners
New Machine Learning Approaches to Text Generation
1. Long Text Generation via Adversarial Training with Leaked Information, by Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, Jun Wang
Original Abstract
Automatically generating coherent and semantically meaningful text has many applications in machine translation, dialogue systems, image captioning, etc. Recently, by combining with policy gradient, Generative Adversarial Nets (GAN) that use a discriminative model to guide the training of the generative model as a reinforcement learning policy has shown promising results in text generation. However, the scalar guiding signal is only available after the entire text has been generated and lacks intermediate information about text structure during the generative process. As such, it limits its success when the length of the generated text samples is long (more than 20 words). In this paper, we propose a new framework, called LeakGAN, to address the problem for long text generation. We allow the discriminative net to leak its own high-level extracted features to the generative net to further help the guidance. The generator incorporates such informative signals into all generation steps through an additional Manager module, which takes the extracted features of current generated words and outputs a latent vector to guide the Worker module for next-word generation. Our extensive experiments on synthetic data and various real-world tasks with Turing test demonstrate that LeakGAN is highly effective in long text generation and also improves the performance in short text generation scenarios. More importantly, without any supervision, LeakGAN would be able to implicitly learn sentence structures only through the interaction between Manager and Worker.
Our Summary
The researchers suggest a new approach to modeling the text generation procedure, namely they introduce a model that combines adversarial training and policy gradient. To overcome the issue of sparse reward in long text generation, Guo et al. propose a hierarchy design for a generator with a Manager and a Worker. A Manager module receives the information on the high-level feature extracted by the discriminator given the current generated word sequence. Since normally no exchange of information is assumed between a generator and a discriminator, the researchers call it a leakage of information and the method is dubbed as LeakGAN. The experiments show that this approach shows significant improvements compared to previous models in terms of BLEU statistics and human evaluations.
What’s the core idea of this paper?
- Introducing a new algorithmic framework called LeakGAN that borrows from recent advances in hierarchical reinforcement learning to provide richer information from the discriminator to the generator:
- a generator is an LSTM and it contains a high-level MANAGER module and low-level WORKER module;
- in each step, the MANAGER receives a high-level feature representation extracted by the discriminator and uses it to form the guiding goal for the WORKER module in that timestep.
What’s the key achievement?
- LeakGAN significantly outperforms the previous approaches in terms of BLEU scores and human evaluations and sets new state-of-the-art results on the EMNLP2017 WMT, COCO Captions, and Chinese Poems datasets.
What does the AI community think?
- The paper was presented at AAAI 2018, a highly selective conference on Artificial Intelligence.
What are future research areas?
- Applying LeakGAN to more NLP tasks, including dialogues systems and image captioning, by providing more task-specific guiding information.
- Enhancing the capacity of the discriminator to check the global consistency of the whole sentence.
What are possible business applications?
- LeakGAN can enhance text generation in such applications as machine translation, dialogue systems, image captioning, etc.
Where can you get implementation code?
- TensorFlow implementation of the LeakGAN approach is available in GitHub.
2. MaskGAN: Better Text Generation via Filling in the______, by William Fedus, Ian Goodfellow, Andrew M. Dai
Original Abstract
Neural text generation models are often autoregressive language models or seq2seq models. These models generate text by sampling words sequentially, with each word conditioned on the previous word, and are state-of-the-art for several machine translation and summarization benchmarks. These benchmarks are often defined by validation perplexity even though this is not a direct measure of the quality of the generated text. Additionally, these models are typically trained via maximum likelihood and teacher forcing. These methods are well-suited to optimizing perplexity but can result in poor sample quality since generating text requires conditioning on sequences of words that may have never been observed at training time. We propose to improve sample quality using Generative Adversarial Networks (GANs), which explicitly train the generator to produce high-quality samples and have shown a lot of success in image generation. GANs were originally designed to output differentiable values, so discrete language generation is challenging for them. We claim that validation perplexity alone is not indicative of the quality of text generated by a model. We introduce an actor-critic conditional GAN that fills in missing text conditioned on the surrounding context. We show qualitatively and quantitatively, evidence that this produces more realistic conditional and unconditional text samples compared to a maximum likelihood trained model.
Our Summary
The Google Brain team argues that maximum likelihood trained models that are widely used for generating text, can result in low quality of the resulting samples since text generation requires conditioning on sequences of words that were not necessarily present during training. They introduce an alternative approach – an actor-critic conditional Generative Adversarial Network (GAN) that is trained to fill in missing text based on the context. The experiments show that this approach results in more realistic text samples.
What’s the core idea of this paper?
- The introduced model, called MaskGAN, is trained on the in-filling task:
- Portions of body text are deleted or redacted, and the goal of the model is to infill the missing portions so that it is indistinguishable from the original text.
- The task reduces to language modeling if the entire body of text is redacted.
- Text generation has a discrete nature, while the GANs have been originally designed to output differentiable values. To overcome this issue, the authors suggest:
- using reinforcement learning to train a generator;
- including the critic to help the generator converge more rapidly by reducing the variance of the gradient estimates in an extremely high action-space environment when operating at word-level.
What’s the key achievement?
- Demonstrating that contiguous in-filling task can be a good approach for reducing mode collapse and helping with training stability for textual GANs.
- Showing that according to human evaluations, MaskGAN generates significantly better samples that maximum likelihood trained model.
What does the AI community think?
- The paper was presented at ICLR 2018, one of the key deep learning conferences.
What are future research areas?
- Considering GAN-training with attention-only models.
- Investigating whether training with discrete nodes may result in more stable training procedures.
Where can you get implementation code?
- The authors provide a TensorFlow implementation of this research paper.
3. Lagging Inference Networks and Posterior Collapse in Variational Autoencoder, by Junxian He, Daniel Spokoyny, Graham Neubig, Taylor Berg-Kirkpatrick
Original Abstract
The variational autoencoder (VAE) is a popular combination of deep latent variable model and accompanying variational learning technique. By using a neural inference network to approximate the model’s posterior on latent variables, VAEs efficiently parameterize a lower bound on marginal data likelihood that can be optimized directly via gradient methods. In practice, however, VAE training often results in a degenerate local optimum known as “posterior collapse” where the model learns to ignore the latent variable and the approximate posterior mimics the prior. In this paper, we investigate posterior collapse from the perspective of training dynamics. We find that during the initial stages of training the inference network fails to approximate the model’s true posterior, which is a moving target. As a result, the model is encouraged to ignore the latent encoding and posterior collapse occurs. Based on this observation, we propose an extremely simple modification to VAE training to reduce inference lag: depending on the model’s current mutual information between latent variable and observation, we aggressively optimize the inference network before performing each model update. Despite introducing neither new model components nor significant complexity over basic VAE, our approach is able to avoid the problem of collapse that has plagued a large amount of previous work. Empirically, our approach outperforms strong autoregressive baselines on text and image benchmarks in terms of held-out likelihood, and is competitive with more complex techniques for avoiding collapse while being substantially faster.
Our Summary
The researchers from Carnegie Mellon University propose a novel training procedure for Variational Autoencoders (VAEs) to address “posterior collapse”, a problem where a model ignores the latent variable and the approximate posterior mimics the prior. In the paper, they investigate the reasons behind the posterior collapse phenomenon, and discover that in the initial stages of training, the posterior approximation often lags far behind the true model posterior. Thus, they suggest to mitigate such lagging behavior with a training procedure that aggressively optimizes the inference network with more updates. The experiments show that this simple approach outperforms neural aggressive baselines on several text benchmarks and performs comparably to more complicated methods that are much slower.
What’s the core idea of this paper?
- Introducing a novel training procedure to address posterior collapse problem:
- Posterior collapse occurs when model learns to ignore latent representations. The phenomenon is more usual with flexible generative models (e.g., LSTM).
- The researchers have discovered empirically that posterior collapse might occur due to the posterior approximation lagging far behind the true model posterior, in the initial stages of training.
- To mitigate this lagging behavior, they suggest using “aggressive updates”, meaning replacing joint gradient updates early in the training with an alternative update scheme: after each step of decoder update, the encoder is updated with as many gradient steps as needed to reach convergence.
- In contrast to the alternative solutions of the posterior collapse problem, this approach doesn’t change the objective function nor introduces additional model components, making it simple and fast.
What’s the key achievement?
- Getting state-of-the-art results on the Yahoo questions and Yelp benchmarks.
- Demonstrating fast training with the training time of the suggested approach being:
- only 2-3 times longer than for a regular VAE, and
- 3-7 times shorter than for alternative approaches to solving posterior collapse problem (e.g., SA-VAE).
What does the AI community think?
- The paper was accepted by ICLR 2019, one of the key deep learning conferences.
- “I find the idea very interesting and promising. The proposed algorithm is very easy to be applied, thus, it could be easily reproduced.”, – from the anonymous paper review.
What are future research areas?
- Exploring the effect of the suggested approach under different priors.
- Investigating the effect of the proposed procedure on other known issues with VAEs.
Where can you get implementation code?
- PyTorch implementation of this research paper is provided on GitHub.
4. Language Models are Unsupervised Multitask Learners, by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever
Original Abstract
Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on task-specific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset – matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
Our Summary
In this paper, the OpenAI team demonstrates that pre-trained language models can be used to solve downstream tasks without any parameter or architecture modifications. They have trained a very big model, 1.5B parameter Transformer, on a large and diverse dataset that contains text scraped from 45 million webpages. The model generates coherent paragraphs of text and achieves promising, competitive or state-of-the-art results on a wide variety of tasks.
What’s the core idea of this paper?
- Training the language model on the large and diverse dataset:
- selecting webpages that have been curated/filtered by humans;
- cleaning and de-duplicating the texts, removing all Wikipedia documents to minimize overlapping of training and test sets;
- using the resulting WebText dataset with slightly over 8 million documents for a total of 40 GB of text.
- Using a byte-level version of Byte Pair Encoding (BPE) for input representation.
- Building a very big Transformer-based model GPT-2:
- the largest model includes 1542M parameters and 48 layers;
- the model mainly follows the OpenAI GPT model with few modifications (i.e., expanding vocabulary and context size, modifying initialization etc.).
What’s the key achievement?
- Getting state-of-the-art results on 7 out of 8 tested language modeling datasets.
- Showing quite promising results in commonsense reasoning, question answering, reading comprehension, and translation.
- Generating coherent texts, for example, a news article about the discovery of talking unicorns.
What does the AI community think?
- “The researchers built an interesting dataset, applying now-standard tools and yielding an impressive model.”, – Zachary C. Lipton, an assistant professor at Carnegie Mellon University.
What are future research areas?
- Investigating fine-tuning on benchmarks such as decaNLP and GLUE to see whether the huge dataset and capacity of GPT-2 can overcome the inefficiencies of BERT’s unidirectional representations.
What are possible business applications?
- In terms of practical applications, the performance of GPT-2 model without any fine-tuning is far from usable but it shows a very promising research direction.
Where can you get implementation code?
- OpenAI decided to release only a smaller version of GPT-2 with 117M parameters. The decision not to release larger models was taken “due to concerns about large language models being used to generate deceptive, biased, or abusive language at scale”.
- Hugging face has introduced a PyTorch implementation of the released GPT-2 model.
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more summary articles like this one.
Badrinath Jayakumar says
“Hugging face has introduced a PyTorch implementation of the released GPT-2 model.”
Can you please fix the link at this line?