What Are Major NLP Achievements & Papers From 2019?

UPDATE: We’ve also summarized the top 2020 NLP research papers.

In 2018 we saw a number of landmark research breakthroughs in the field of natural language processing (NLP). The introduction of transfer learning and pretrained language models in NLP pushed forward the limits of language understanding and generation. These also dominated NLP progress this year.

Teams from top research institutions and tech companies explored ways to make state-of-the-art language models even more sophisticated.Many improvements were driven by massive boosts in computing capacities, but many research groups also discovered ingenious ways to lighten models while maintaining high performance.

In this article, we summarize 11 research papers covering key language models presented during the year as well as recent research breakthroughs in machine translation, sentiment analysis, dialogue systems, and abstractive summarization. Of course, there are many more papers worth your attention, but we hope to provide you with a solid foundational understanding of the NLP research presented in 2019.

Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.

If you’d like to skip around, here are the papers we featured:

Language Models Are Unsupervised Multitask Learners
XLNet: Generalized Autoregressive Pretraining for Language Understanding
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Emotion-Cause Pair Extraction: A New Task to Emotion Analysis in Texts
Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems
Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks
Probing the Need for Visual Context in Multimodal Machine Translation
Bridging the Gap between Training and Inference for Neural Machine Translation
On Extractive and Abstractive Neural Document Summarization with Transformer Language Models
CTRL: A Conditional Transformer Language Model For Controllable Generation
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

11 Important NLP Research Papers of 2019

1. Language Models Are Unsupervised Multitask Learners, by Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever

Original Abstract

Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on task-specific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset – matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

Our Summary

In this paper, the OpenAI team demonstrates that pre-trained language models can be used to solve downstream tasks without any parameter or architecture modifications. They have trained a very big model, a 1.5B-parameter Transformer, on a large and diverse dataset that contains text scraped from 45 million webpages. The model generates coherent paragraphs of text and achieves promising, competitive or state-of-the-art results on a wide variety of tasks.

What’s the core idea of this paper?

Training the language model on the large and diverse dataset:
- selecting webpages that have been curated/filtered by humans;
- cleaning and de-duplicating the texts, and removing all Wikipedia documents to minimize overlapping of training and test sets;
- using the resulting WebText dataset with slightly over 8 million documents for a total of 40 GB of text.
Using a byte-level version of Byte Pair Encoding (BPE) for input representation.
Building a very big Transformer-based model, GPT-2:
- the largest model includes 1542M parameters and 48 layers;
- the model mainly follows the OpenAI GPT model with few modifications (i.e., expanding vocabulary and context size, modifying initialization etc.).

What’s the key achievement?

Getting state-of-the-art results on 7 out of 8 tested language modeling datasets.
Showing quite promising results in commonsense reasoning, question answering, reading comprehension, and translation.
Generating coherent texts, for example, a news article about the discovery of talking unicorns.

What does the AI community think?

“The researchers built an interesting dataset, applying now-standard tools and yielding an impressive model.” – Zachary C. Lipton, an assistant professor at Carnegie Mellon University.

What are future research areas?

Investigating fine-tuning on benchmarks such as decaNLP and GLUE to see whether the huge dataset and capacity of GPT-2 can overcome the inefficiencies of BERT’s unidirectional representations.

What are possible business applications?

In terms of practical applications, the performance of the GPT-2 model without any fine-tuning is far from usable but it shows a very promising research direction.

Where can you get implementation code?

Initially, OpenAI decided to release only a smaller version of GPT-2 with 117M parameters. The decision not to release larger models was taken “due to concerns about large language models being used to generate deceptive, biased, or abusive language at scale”.
In November, OpenAI finally released its largest 1.5B-parameter model. The code is available here.
Hugging Face has introduced a PyTorch implementation of the initially released GPT-2 model.

2. XLNet: Generalized Autoregressive Pretraining for Language Understanding, by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le

Original Abstract

With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, XLNet outperforms BERT on 20 tasks, often by a large margin, and achieves state-of-the-art results on 18 tasks including question answering, natural language inference, sentiment analysis, and document ranking.

Our Summary

The researchers from Carnegie Mellon University and Google have developed a new model, XLNet, for natural language processing (NLP) tasks such as reading comprehension, text classification, sentiment analysis, and others. XLNet is a generalized autoregressive pretraining method that leverages the best of both autoregressive language modeling (e.g., Transformer-XL) and autoencoding (e.g., BERT) while avoiding their limitations. The experiments demonstrate that the new model outperforms both BERT and Transformer-XL and achieves state-of-the-art performance on 18 NLP tasks.

What’s the core idea of this paper?

XLNet combines the bidirectional capability of BERT with the autoregressive technology of Transformer-XL:
- Like BERT, XLNet uses a bidirectional context, which means it looks at the words before and after a given token to predict what it should be. To this end, XLNet maximizes the expected log-likelihood of a sequence with respect to all possible permutations of the factorization order.
- As an autoregressive language model, XLNet doesn’t rely on data corruption, and thus avoids BERT’s limitations due to masking – i.e., pretrain-finetune discrepancy and the assumption that unmasked tokens are independent of each other.
To further improve architectural designs for pretraining, XLNet integrates the segment recurrence mechanism and relative encoding scheme of Transformer-XL.

What’s the key achievement?

XLnet outperforms BERT on 20 tasks, often by a large margin.
The new model achieves state-of-the-art performance on 18 NLP tasks including question answering, natural language inference, sentiment analysis, and document ranking.

What does the AI community think?

The paper was accepted for oral presentation at NeurIPS 2019, the leading conference in artificial intelligence.
“The king is dead. Long live the king. BERT’s reign might be coming to an end. XLNet, a new model by people from CMU and Google outperforms BERT on 20 tasks.” – Sebastian Ruder, a research scientist at Deepmind.
“XLNet will probably be an important tool for any NLP practitioner for a while…[it is] the latest cutting-edge technique in NLP.” – Keita Kurita, Carnegie Mellon University.

What are future research areas?

Extending XLNet to new areas, such as computer vision and reinforcement learning.

What are possible business applications?

XLNet may assist businesses with a wide range of NLP problems, including:
- chatbots for first-line customer support or answering product inquiries;
- sentiment analysis for gauging brand awareness and perception based on customer reviews and social media;
- the search for relevant information in document bases or online, etc.

Where can you get implementation code?

The authors have released the official Tensorflow implementation of XLNet.
PyTorch implementation of the model is also available on GitHub.

3. RoBERTa: A Robustly Optimized BERT Pretraining Approach, by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov

Original Abstract

Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

Our Summary

Natural language processing models have made significant advances thanks to the introduction of pretraining methods, but the computational expense of training has made replication and fine-tuning parameters difficult. In this study, Facebook AI and the University of Washington researchers analyzed the training of Google’s Bidirectional Encoder Representations from Transformers (BERT) model and identified several changes to the training procedure that enhance its performance. Specifically, the researchers used a new, larger dataset for training, trained the model over far more iterations, and removed the next sequence prediction training objective. The resulting optimized model, RoBERTa (Robustly Optimized BERT Approach), matched the scores of the recently introduced XLNet model on the GLUE benchmark.

What’s the core idea of this paper?

The Facebook AI research team found that BERT was significantly undertrained and suggested an improved recipe for its training, called RoBERTa:
- More data: 160GB of text instead of the 16GB dataset originally used to train BERT.
- Longer training: increasing the number of iterations from 100K to 300K and then further to 500K.
- Larger batches: 8K instead of 256 in the original BERT base model.
- Larger byte-level BPE vocabulary with 50K subword units instead of character-level BPE vocabulary of size 30K.
- Removing the next sequence prediction objective from the training procedure.
- Dynamically changing the masking pattern applied to the training data.

What’s the key achievement?

RoBERTa outperforms BERT in all individual tasks on the General Language Understanding Evaluation (GLUE) benchmark.
The new model matches the recently introduced XLNet model on the GLUE benchmark and sets a new state of the art in four out of nine individual tasks.

What are future research areas?

Incorporating more sophisticated multi-task finetuning procedures.

What are possible business applications?

Big pretrained language frameworks like RoBERTa can be leveraged in the business setting for a wide range of downstream tasks, including dialogue systems, question answering, document classification, etc.

Where can you get implementation code?

The models and code used in this study are available on GitHub.

4. Emotion-Cause Pair Extraction: A New Task to Emotion Analysis in Texts, by Rui Xia and Zixiang Ding

Original Abstract

Emotion cause extraction (ECE), the task aimed at extracting the potential causes behind certain emotions in text, has gained much attention in recent years due to its wide applications. However, it suffers from two shortcomings: 1) the emotion must be annotated before cause extraction in ECE, which greatly limits its applications in real-world scenarios; 2) the way to first annotate emotion and then extract the cause ignores the fact that they are mutually indicative. In this work, we propose a new task: emotion-cause pair extraction (ECPE), which aims to extract the potential pairs of emotions and corresponding causes in a document. We propose a 2-step approach to address this new ECPE task, which first performs individual emotion extraction and cause extraction via multi-task learning, and then conduct emotion-cause pairing and filtering. The experimental results on a benchmark emotion cause corpus prove the feasibility of the ECPE task as well as the effectiveness of our approach.

Our Summary

Emotion cause extraction (ECE) is an approach used in natural language processing to identify statements containing the causes behind vocabulary expressing emotion. However, ECE requires emotions to first be annotated and ignores mutual relationships between causes and emotional effects. The researchers sought to solve this problem by simultaneously identifying pairs of emotions and causes in a task they call emotion-cause pair extraction (ECPE). ECPE uses a two-step approach: the first step uses two multi-task learning networks to identify emotion and cause clauses, while the second step pairs all causes and emotions, and uses a trained filter to eliminate pairings that do not contain a causal relationship. The resulting ECPE task is able to identify emotion-cause pairs at an accuracy on par with existing ECE methods but without requiring emotion annotation.

What’s the core idea of this paper?

The paper introduces a new emotion-cause pair extraction (ECPE) task to overcome the limitations of the traditional ECE task, where emotion annotation is required prior to cause extraction and mutual indicativeness of emotion and cause is not taken into account.
The introduced approach consists of two steps:
- In the first step, the two individual tasks of emotion extraction and cause extraction are performed via two kinds of multi-task learning networks:
  - Inter-EC that uses emotion extraction to improve cause extraction;
  - Inter-CE that leverages cause extraction to enhance emotion extraction.
- In the second step, the model combines all elements of the two sets into pairs by applying a Cartesian product. Then, a logistic regression model is trained to eliminate pairs that do not contain a causal relationship.

What’s the key achievement?

ECPE is able to achieve F1 scores of 0.83 for emotion extraction, 0.65 for cause extraction, and 0.61 for emotion-cause pairing.
On the ECE benchmark dataset, ECPE performs on par with existing ECE methods that require emotion annotation before causal clauses can be identified.

What does the AI community think?

The paper received an Outstanding Paper award at ACL 2019.

What are future research areas?

Altering the ECPE approach from a two-step to a one-step process that directly extracts emotion-cause pairs in an end-to-end fashion.

What are possible business applications?

Sentiment analysis for marketing campaigns.
Opinion monitoring from social media.

Where can you get implementation code?

The code used in this study is available on GitHub.

5. Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems, by Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-Asl, Caiming Xiong, Richard Socher, Pascale Fung

Original Abstract

Over-dependence on domain ontology and lack of knowledge sharing across domains are two practical and yet less studied problems of dialogue state tracking. Existing approaches generally fall short in tracking unknown slot values during inference and often have difficulties in adapting to new domains. In this paper, we propose a Transferable Dialogue State Generator (TRADE) that generates dialogue states from utterances using a copy mechanism, facilitating knowledge transfer when predicting (domain, slot, value) triplets not encountered during training. Our model is composed of an utterance encoder, a slot gate, and a state generator, which are shared across domains. Empirical results demonstrate that TRADE achieves state-of-the-art joint goal accuracy of 48.62% for the five domains of MultiWOZ, a human-human dialogue dataset. In addition, we show its transferring ability by simulating zero-shot and few-shot dialogue state tracking for unseen domains. TRADE achieves 60.58% joint goal accuracy in one of the zero-shot domains, and is able to adapt to few-shot cases without forgetting already trained domains.

Our Summary

The research team from the Hong Kong University of Science and Technology and Salesforce Research addresses the problem of over-dependence on domain ontology and lack of knowledge sharing across domains. In a practical scenario, many slots share all or some of their values among different domains (e.g., the area slot can exist in many domains like restaurant, hotel, or taxi), and thus transferring knowledge across multiple domains is imperative for dialogue state tracking (DST) models. The researchers introduce a TRAnsferable Dialogue statE generator (TRADE) that leverages its context-enhanced slot gate and copy mechanism to track slot values mentioned anywhere in a dialogue history. TRADE shares its parameters across domains and doesn’t require a predefined ontology, which enables tracking of previously unseen slot values. The experiments demonstrate the effectiveness of this approach with TRADE achieving state-of-the-art joint goal accuracy of 48.62% on a challenging MultiWOZ dataset.

What’s the core idea of this paper?

To overcome over-dependence on domain ontology and lack of knowledge sharing across domains, the researchers suggest:
- generating slot values directly instead of predicting the probability of every predefined ontology term;
- sharing all the model parameters across domains.
The TRADE model consists of three components:
- an utterance encoder to encode dialogue utterances into a sequence of fixed-length vectors;
- a slot gate to predict whether a certain (domain, slot) pair is triggered by the dialogue;
- a state generator to decode multiple output tokens for all (domain, slot) pairs independently to predict their corresponding values.

What’s the key achievement?

On a challenging MultiWOZ dataset of human-human dialogues, TRADE achieves joint goal accuracy of 48.62%, setting a new state of the art.
Moreover, TRADE achieves 60.58% joint goal accuracy in one of the zero-shot domains, demonstrating its ability to transfer knowledge to previously unseen domains.
The experiments also demonstrate the model’s ability to adapt to new few-shot domains without forgetting already trained domains.

What does the AI community think?

The paper received an Outstanding Paper award at the main ACL 2019 conference and the Best Paper Award at NLP for Conversational AI Workshop at the same conference.

What are future research areas?

Transferring knowledge from other resources to further improve zero-shot performance.
Collecting a dataset with a large number of domains to facilitate the study of techniques within multi-domain dialogue state tracking.

What are possible business applications?

The current research can significantly improve the performance of task-oriented dialogue systems in multi-domain settings.

Where can you get implementation code?

The PyTorch implementation of this study is available on GitHub.

6. Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks, by Yikang Shen, Shawn Tan, Alessandro Sordoni, Aaron Courville

Original Abstract

Natural language is hierarchically structured: smaller units (e.g., phrases) are nested within larger units (e.g., clauses). When a larger constituent ends, all of the smaller constituents that are nested within it must also be closed. While the standard LSTM architecture allows different neurons to track information at different time scales, it does not have an explicit bias towards modeling a hierarchy of constituents. This paper proposes to add such an inductive bias by ordering the neurons; a vector of master input and forget gates ensures that when a given neuron is updated, all the neurons that follow it in the ordering are also updated. Our novel recurrent architecture, ordered neurons LSTM (ON-LSTM), achieves good performance on four different tasks: language modeling, unsupervised parsing, targeted syntactic evaluation, and logical inference.

Our Summary

The joint group of researchers from the Université de Montréal and Microsoft studies the problem of integrating tree structures into recurrent neural networks (RNNs). Natural language is hierarchically structured, with smaller units (e.g. phrases) nested within larger units (e.g. clauses), but this hierarchy is not reflected in a standard RNN architecture. In this paper, the authors propose to address this problem by ordering neurons. In particular, they use a new activation function, the cumulative softmax(), to produce a vector of master input and forget gates: when a given neuron is updated (or erased), all of the neurons that follow it in the ordering are also updated (or erased). The experiments demonstrate that this new architecture, called ON-LSTM, ordered neurons long short-term memory network, performs well on a variety of NLP tasks.

What’s the core idea of this paper?

Even though some evidence exists that LSTMs can potentially encode the tree structure implicitly, the researchers believe that better results can be obtained by equipping the model with an inductive bias towards learning such latent tree structures.
To this end, they introduce a new inductive bias for RNNs, namely ordered neurons, that forces neurons to represent information at different time scales:
- High-ranking neurons will store long-term information to be kept for many steps, while low-ranking neurons will store short-term information that only last one or a few time steps.
- The differentiation between high-ranking and low-ranking neurons is learned in a data-driven fashion by controlling the update frequency of single neurons.
- In particular, a new activation function, the cumulative softmax(), or cumax(), ensures that some neurons are updated more (or less) frequently the others: to erase (or update) high-ranking neurons, the model should first erase (or update) all lower-ranking neurons.

What’s the key achievement?

The proposed model performs better than standard LSTMs on such tasks as:
- language modeling;
- targeted syntactic evaluation (on long-term dependency cases);
- logical inference (especially on longer sequences).
The ON-LSTM model also performs well on an unsupervised constituency parsing task:
- it outperforms previous models in terms of generalization and robustness toward longer sentences;
- it gives strong results for phrase detection, including adjective, prepositional, and noun phrases;
- the model induces the latent structure of natural language in coherence with human annotations.

What does the AI community think?

The paper received the Best Paper award at ICLR 2019, the key conference in machine learning.

What are possible business applications?

The proposed model can benefit many downstream NLP tasks, including question answering, named entity recognition, co-reference resolution, and others.

Where can you get implementation code?

The code used for the word-level language model and unsupervised parsing experiments in this paper is available on GitHub.

7. Probing the Need for Visual Context in Multimodal Machine Translation, by Ozan Caglayan, Pranava Madhyastha, Lucia Specia, Loïc Barrault

Original Abstract

Current work on multimodal machine translation (MMT) has suggested that the visual modality is either unnecessary or only marginally beneficial. We posit that this is a consequence of the very simple, short and repetitive sentences used in the only available dataset for the task (Multi30K), rendering the source text sufficient as context. In the general case, however, we believe that it is possible to combine visual and textual information in order to ground translations. In this paper we probe the contribution of the visual modality to state-of-the-art MMT models by conducting a systematic analysis where we partially deprive the models from source-side textual context. Our results show that under limited textual context, models are capable of leveraging the visual input to generate better translations. This contradicts the current belief that MMT models disregard the visual modality because of either the quality of the image features or the way they are integrated into the model.

Our Summary

This paper addresses the importance of visual context in multimodal machine translation (MMT). Previous work demonstrated that visual modality is either totally unnecessary or just marginally beneficial for MMT models. The authors of this paper argue that such findings were obtained only because the Multi30K dataset used for evaluating the models contains very short and simple sentences, which makes the textual context sufficient for good translation. To test their hypothesis, the researchers introduce several input-degrading regimes and show that in the scenarios with scarce linguistic context, the models successfully exploit visual context.

What’s the core idea of this paper?

The research team hypothesized that multimodal machine translation models ignore visual information because of the sufficiency of textual input for providing high-quality machine translation.
To investigate this hypothesis, they degraded textual information fed to the model by:
- removing color information from the text (3% of words);
- masking out the head nouns in the source sentences (26% of words);
- progressively masking all but the first k tokens up to the scenario when no words are available to the model and it only knows the source sentence length.
The experiments also included providing images unrelated to the accompanying text to see whether that would reduce the accuracy of the multimodal translation model.

What’s the key achievement?

The researchers demonstrated that multimodal machine translation models do use visual information for translation when linguistic information is scarce:
- When color information is removed from text, an attentive multimodal machine translation model outperformed a neural machine translation model (NMT) by 12%.
- With entity masking, attentive MMT models show up to 4.2 METEOR improvement over NMT.
- When linguistic information is progressively unmasked, MMT models gradually become less sensitive to the visual modality, and finally, perform on par with NMT models when provided with totally unrelated images.

What does the AI community think?

The paper received the Best Short Paper award at NAACL-HLT 2019, one of the key conferences in natural language processing.

What are future research areas?

Generating models that can learn when and how to integrate multiple modalities by taking care of the complementary and redundant aspects of these modalities.

What are possible business applications?

Improving the performance of machine translation systems in scenarios where the linguistic context is scarce but the relevant images are available (e.g., automated translation of signage, such as traffic and warning signs).

8. Bridging the Gap between Training and Inference for Neural Machine Translation, by Wen Zhang, Yang Feng, Fandong Meng, Di You, Qun Liu

Original Abstract

Neural Machine Translation (NMT) generates target words sequentially in the way of predicting the next word conditioned on the context words. At training time, it predicts with the ground truth words as context while at inference it has to generate the entire sequence from scratch. This discrepancy of the fed context leads to error accumulation among the way. Furthermore, word-level training requires strict matching between the generated sequence and the ground truth sequence which leads to overcorrection over different but reasonable translations. In this paper, we address these issues by sampling context words not only from the ground truth sequence but also from the predicted sequence by the model during training, where the predicted sequence is selected with a sentence-level optimum. Experiment results on Chinese->English and WMT’14 English->German translation tasks demonstrate that our approach can achieve significant improvements on multiple datasets.

Our Summary

In this paper, the researchers address the long-standing problem of exposure bias in sequence-to-sequence machine translation. Namely, at training time the ground-truth words are used as context, while at inference time the previous words generated by the model are fed as context. In such a scenario, the model predicts under conditions it has never met at training time. The suggested solution is to train the model by sampling context words not only from the ground-truth sequence but also from the decoder’s output received during model training. Moreover, the context words from the model predictions are selected not only with a word-by-word greedy search but also with a sentence-level evaluation to avoid overcorrection. The introduced approach significantly improves the performance of both the RNNSearch and Transformer models.

What’s the core idea of this paper?

The paper addresses the problems of exposure bias and overcorrection in sequence-to-sequence translation.
The authors propose a novel training method in which the model is fed both ground-truth words and oracle words derived from the predicted translation:
- As the model converges during training, oracle words are chosen as context more frequently than ground truth words. Thus, the training process gradually changes from a fully guided scheme towards a less guided scheme.
- To allow for multiple possible correct translations, the oracle words are selected not only with a word-by-word greedy search but also with a sentence-level evaluation.
The proposed approach allows the model to handle mistakes made at inference and also recover from the overcorrection of alternative translations.

What’s the key achievement?

The proposed approach to training the neural machine translation model improves the performance of:
- the RNNSearch model by 2.36 BLEU points and the Transformer model by 1.51 BLEU points on average in Chinese to English translation tasks;
- the RNNSearch model by 1.59 BLEU points and the Transformer model by 1.31 BLEU points in English to German translation tasks.

What does the AI community think?

The paper received the Best Long Paper Award at ACL 2019, the leading conference in natural language processing:
- “The experiments are solid, the results are convincing and likely to influence future work in MT” – from the motivation for the award.

What are possible business applications?

Improving the performance of all systems that rely on neural machine translation, including:
- mobile translation apps;
- the automated translation of books, articles, and other media.

9. On Extractive and Abstractive Neural Document Summarization with Transformer Language Models, by Sandeep Subramanian, Raymond Li, Jonathan Pilault, Christopher Pal

Original Abstract

We present a method to produce abstractive summaries of long documents that exceed several thousand words via neural abstractive summarization. We perform a simple extractive step before generating a summary, which is then used to condition the transformer language model on relevant information before being tasked with generating a summary. We show that this extractive step significantly improves summarization results. We also show that this approach produces more abstractive summaries compared to prior work that employs a copy mechanism while still achieving higher rouge scores. Note: The abstract above was not written by the authors, it was generated by one of the models presented in this paper.

Our Summary

The paper introduces an abstractive summarizer that deploys a transformer architecture for summarizing long documents. The data is first formatted in a manner that makes the transformer language model the most effective. Specifically, in the first step, sentence extraction is performed to get the most important sentences from the document. In the second step, the suggested approach involves training a transformer language model from scratch on the extractive summaries and on the original documents (or their introductions if the documents were too long). The experiments demonstrate that the suggested approach is surprisingly effective at summarizing long research papers and produces “more abstractive” summaries compared to prior work with copy mechanisms while achieving higher ROUGE scores.

What’s the core idea of this paper?

The introduced model comprises two distinct trainable components, namely:
- A hierarchical document representation model to extract the most important sentences. The researchers used a hierarchical seq2seq sentence pointer and a sentence classifier to produce the extractive summary.
- A transformer language model to create an abstractive summary conditioning on the extracted sentences and the original document, or its introduction if the document is very long:
  - At training, the transformer gets input data in a particular order, i.e., introduction (or the entire document), extracted sentences, abstract, the rest of the paper.
  - At inference, introduction (or the entire document) and extracted sentences are provided for predicting the abstract.

What’s the key achievement?

The introduced approach:
- outperforms previous extractive and abstractive summarization techniques on the arXiv, PubMed and bigPatent datasets;
- produces more “abstractive” summaries compared to models that use a copy mechanism, and still achieves higher ROUGE scores.
The generated abstracts are comparable to the papers’ original human-written abstracts in terms of their low proportions of n-grams copied from the full articles.

What does the AI community think?

On Twitter, commentators from within the AI field are delighted by the self-referentiality of having the model write the abstract for its own paper.

What are future research areas?

Exploring the possibility of training the extractive and abstractive steps in an end-to-end manner.
Developing summarization models that will demonstrate the human-like ability to coherently and concisely synthesize summaries, while respecting the underlying facts.

What are possible business applications?

Summarizing news articles for services that produce daily digests.
Creating summaries of journal articles, white papers, etc to help decision-makers absorb information more quickly.

10. CTRL: A Conditional Transformer Language Model For Controllable Generation, by Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Socher

Original Abstract

Large-scale language models show promising text generation capabilities, but users cannot easily control particular aspects of the generated text. We release CTRL, a 1.6 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Control codes were derived from structure that naturally co-occurs with raw text, preserving the advantages of unsupervised learning while providing more explicit control over text generation. These codes also allow CTRL to predict which parts of the training data are most likely given a sequence. This provides a potential method for analyzing large amounts of data via model-based source attribution. We have released multiple full-sized, pretrained versions of CTRL at https://www.github.com/salesforce/ctrl.

Our Summary

Language models used for text generation are very powerful, but they are often “black boxes”, so users do not have much control over the output. To address this problem, the Salesforce research team has introduced the Conditional Transformer Language (CTRL) model that conditions on a set of control codes. With these codes, the users can control domain, style, topics, dates, entities, relationships between entities, plot points, and task-related behavior. Moreover, all control codes can be traced back to a specific subset of the training data, allowing CTRL to predict the subset of the training data most likely leveraged for a particular sequence. This relationship between CTRL and its training data provides new possibilities for analyzing the correlations learned from each domain.

What’s the core idea of this paper?

Text generation tools are very powerful, but they do not give users much control over the content, style or genre of the generated text.
The Salesforce research team has released CTRL, a 1.6 billion-parameter conditional transformer language model, that gives users more control over the generated content:
- CTRL exposes keywords called control codes which allow users to specify a domain, style, topics, dates, entities, relationships between entities, plot points, and task-related behavior.
- CTRL is trained on control codes derived from the structure that naturally co-occurs with the raw text. In particular, CTRL leverages the fact that training data is usually associated with a URL that contains information relevant to the text it represents.

What’s the key achievement?

Introducing and open-sourcing a language model that:
- enables more controllable text generation;
- provides new opportunities for analyzing large amounts of text via model-based source attribution;
- can be used to detect artificially generated text.

What does the AI community think?

The community appreciates that the researchers offered such a clear discussion on the ethical considerations behind releases of large language models and that they included it in a separate section of the paper rather than relegating it to a blog post.

What are future research areas?

Introducing a greater variety of control codes to allow finer-grained control.
Extending to other areas of NLP including abstractive summarization and commonsense reasoning.
Analyzing the relationships between training data and language models.
Exploring the possibilities to make the interface between humans and language models more explicit and intuitive.

What are possible business applications?

Improved and tailored text generation for question-answering systems and other human-computer interaction applications.
Identifying artificially generated text, to detect malicious uses such as automatically generated essays or fake reviews.

Where can you get implementation code?

The authors have released multiple full-sized, pretrained versions of CTRL on GitHub.

11. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut

Original Abstract

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations, longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.

Our Summary

The Google Research team addresses the problem of the continuously growing size of the pretrained language models, which results in memory limitations, longer training time, and sometimes unexpectedly degraded performance. Specifically, they introduce A Lite BERT (ALBERT) architecture that incorporates two parameter-reduction techniques: factorized embedding parameterization and cross-layer parameter sharing. In addition, the suggested approach includes a self-supervised loss for sentence-order prediction to improve inter-sentence coherence. The experiments demonstrate that the best version of ALBERT sets new state-of-the-art results on GLUE, RACE, and SQuAD benchmarks while having fewer parameters than BERT-large.

What’s the core idea of this paper?

It is not reasonable to further improve language models by making them larger because of memory limitations of available hardware, longer training times, and unexpected degradation of model performance with the increased number of parameters.
To address this problem, the researchers introduce the ALBERT architecture that incorporates two parameter-reduction techniques:
- factorized embedding parameterization, where the size of the hidden layers is separated from the size of vocabulary embeddings by decomposing the large vocabulary-embedding matrix into two small matrices;
- cross-layer parameter sharing to prevent the number of parameters from growing with the depth of the network.
The performance of ALBERT is further improved by introducing the self-supervised loss for sentence-order prediction to address BERT’s limitations with regard to inter-sentence coherence.

What’s the key achievement?

With the introduced parameter-reduction techniques, the ALBERT configuration with 18× fewer parameters and 1.7× faster training compared to the original BERT-large model achieves only slightly worse performance.
The much larger ALBERT configuration, which still has fewer parameters than BERT-large, outperforms all of the current state-of-the-art language modes by getting:
- 89.4% accuracy on the RACE benchmark;
- 89.4 score on the GLUE benchmark; and
- An F1 score of 92.2 on the SQuAD 2.0 benchmark.

What does the AI community think?

The paper has been submitted to ICLR 2020 and is available on the OpenReview forum, where you can see the reviews and comments of NLP experts. The reviewers are mainly very appreciative of the presented paper.

What are future research areas?

Speeding up training and inference through methods like sparse attention and block attention.
Further improving the model performance through hard example mining, more efficient model training, and other approaches.

What are possible business applications?

The ALBERT language model can be leveraged in the business setting to improve performance on a wide range of downstream tasks, including chatbot performance, sentiment analysis, document mining, and text classification.

Where can you get implementation code?

The original implementation of ALBERT is available on GitHub.
A TensorFlow implementation of ALBERT is also available here.
A PyTorch implementation of ALBERT can be found here and here.

If you like these research summaries, you might be also interested in the following articles:

We’ll let you know when we release more summary articles like this one.

Bots

Brands

Business

China

Commerce

Computer Vision

Conversational AI

Customer Service

Cybersecurity

Data Science & Engineering

Design

Education

Ethics & Safety

Finance

Gaming

Healthcare

HR & Recruiting

Infrastructure

Leadership & Management

Manufacturing

Marketing

Natural Language Processing

Reinforcement Learning

Research

Retail & CPG

Society

Technical Guide

Technology

About TOPBOTS