14 NLP Research Breakthroughs You Can Apply To Your Business

Top NLP Research Papers of 2018 Summarized By Mariya Yao TOPBOTS

UPDATE: We’ve also summarized the top 2019 and top 2020 NLP research papers.

Language understanding is a challenge for computers. Subtle nuances of communication that human toddlers can understand still confuse the most powerful machines. Even though advanced techniques like deep learning can detect and replicate complex language patterns, machine learning models still lack fundamental conceptual understanding of what our words really mean.

That said, 2018 did yield a number of landmark research breakthroughs which pushed the fields of natural language processing, understanding, and generation forward.

We summarized 14 research papers covering several advances in natural language processing (NLP), including high-performing transfer learning techniques, more sophisticated language models, and newer approaches to content understanding. There are hundreds more papers in NLP, NLU, and NLG which we have not covered in this summary, but we hope for this article to give you a solid foundational understanding of the key papers of 2018.

Due to the importance and prevalence of NLP for applied and enterprise AI, we did feature some of the papers below in our previous article summarizing the top overall machine learning papers of 2018. Since you might not have read that previous piece, we chose to highlight the NLP ones again here.

We’ve done our best to summarize these papers correctly, but if we’ve made any mistakes, please contact us to request a fix.

If these summaries of scientific AI research papers are useful for you, you can subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries. We’re planning to release summaries of important papers in computer vision, reinforcement learning, and conversational AI in the next few weeks.

If you’d like to skip around, here are the papers we featured:

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Sequence Classification with Human Attention
Phrase-Based & Neural Unsupervised Machine Translation
What you can cram into a single vector: Probing sentence embeddings for linguistic properties
SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
Deep contextualized word representations
Meta-Learning for Low-Resource Neural Machine Translation
Linguistically-Informed Self-Attention for Semantic Role Labeling
A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks
Know What You Don’t Know: Unanswerable Questions for SQuAD
An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
Universal Language Model Fine-tuning for Text Classification
Improving Language Understanding by Generative Pre-Training
Dissecting Contextual Word Embeddings: Architecture and Representation

Important Natural Language Processing (NLP) Research Papers of 2018

1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

Original Abstract

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE benchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5% absolute improvement), outperforming human performance by 2.0%.

Our Summary

A Google AI team presents a new cutting-edge model for Natural Language Processing (NLP) – BERT, or Bidirectional Encoder Representations from Transformers. Its design allows the model to consider the context from both the left and the right sides of each word. While being conceptually simple, BERT obtains new state-of-the-art results on eleven NLP tasks, including question answering, named entity recognition and other tasks related to general language understanding.

What’s the core idea of this paper?

Training a deep bidirectional model by randomly masking a percentage of input tokens – thus, avoiding cycles where words can indirectly “see themselves”.
Also pre-training a sentence relationship model by building a simple binary classification task to predict whether sentence B immediately follows sentence A, thus allowing BERT to better understand relationships between sentences.
Training a very big model (24 Transformer blocks, 1024-hidden, 340M parameters) with lots of data (3.3 billion word corpus).

What’s the key achievement?

Advancing the state-of-the-art for 11 NLP tasks, including:
- getting a GLUE score of 80.4%, which is 7.6% of absolute improvement from the previous best result;
- achieving 93.2% accuracy on SQuAD 1.1 and outperforming human performance by 2%.
Suggesting a pre-trained model, which doesn’t require any substantial architecture modifications to be applied to specific NLP tasks.

What does the AI community think?

BERT model marks a new era of NLP.
In a nutshell, two unsupervised tasks together (“fill in the blank” and “does sentence B comes after sentence A?” ) provide great results for many NLP tasks.
Pre-training of language models becomes a new standard.

What are future research areas?

Testing the method on a wider range of tasks.
Investigating the linguistic phenomena that may or may not be captured by BERT.

What are possible business applications?

BERT may assist businesses with a wide range of NLP problems, including:
- chatbots for better customer experience;
- analysis of customer reviews;
- the search for relevant information, etc.

Where can you get implementation code?

Google Research has released an official Github repository with Tensorflow code and pre-trained models for BERT.
PyTorch implementation of BERT is also available on GitHub.

2. Sequence Classification with Human Attention, by Maria Barrett, Joachim Bingel, Nora Hollenstein, Marek Rei, Anders Søgaard

Original Abstract

Learning attention functions requires large volumes of data, but many NLP tasks simulate human behavior, and in this paper, we show that human attention really does provide a good inductive bias on many attention functions in NLP. Specifically, we use estimated human attention derived from eye-tracking corpora to regularize attention functions in recurrent neural networks. We show substantial improvements across a range of tasks, including sentiment analysis, grammatical error detection, and detection of abusive language.

Our Summary

Maria Barrett and her colleagues suggest using human attention derived from eye-tracking corpora to regularize attention in recurrent neural networks (RNN). By leveraging publicly available eye-tracking corpora, i.e., texts augmented with eye-tracking measures such as fixation duration times, they were able to substantially improve the RNN’s accuracy across three NLP tasks, including sentiment analysis, detection of abusive language, and grammatical error detection.

What’s the core idea of this paper?

Using human attention, as estimated from eye-tracking corpora, to regularize machine attention.
The input to the model is a set of labeled sequences (sentences paired with discrete category labels) and a set of sequences, in which each token is associated with a scalar value representing the attention human readers devoted to this token on average.
The RNN jointly learns the recurrent parameters and the attention function but can alternate between supervision signals from labeled sequences and from attention trajectories in eye-tracking corpora.
The suggested approach does not require the target task data to come with eye-tracking information.

What’s the key achievement?

Introducing a recurrent neural architecture with attention for sequence classification tasks.
Demonstrating that using human attention, as estimated from eye-tracking corpora, to regularize attention functions leads to significant improvements across a range of tasks, including:
- sentiment analysis,
- detection of abusive language, and
- grammatical error detection.
Getting a mean error reduction of 4.5% over the baseline. The improvements are primarily due to increased recall.

What does the AI community think?

The paper received a special award for the best paper on research inspired by human language learning and processing at CoNLL 2018, a top-tier conference on Computational Natural Language Learning.
“This work is really cool, and is one of very few that directly use signals from humans doing instinctive tasks.” – Patrick Lewis, FAIR Ph.D. student.

What are future research areas?

Exploring other possibilities to leverage human attention as an inductive bias on machine attention when learning human-related tasks.

What are possible business applications?

RNNs complemented with human attention signals can be applied in business settings:
- to enhance automatic analysis of customer reviews;
- filtering out abusive comments, reviews, and remarks.

Where can you get implementation code?

Code for this research paper is available on GitHub.

3. Phrase-Based & Neural Unsupervised Machine Translation, by Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, Marc’Aurelio Ranzato

Original Abstract

Machine translation systems achieve near human-level performance on some languages, yet their effectiveness strongly relies on the availability of large amounts of parallel sentences, which hinders their applicability to the majority of language pairs. This work investigates how to learn to translate when having access to only large monolingual corpora in each language. We propose two model variants, a neural and a phrase-based model. Both versions leverage a careful initialization of the parameters, the denoising effect of language models and automatic generation of parallel data by iterative back-translation. These models are significantly better than methods from the literature, while being simpler and having fewer hyper-parameters. On the widely used WMT’14 English-French and WMT’16 German-English benchmarks, our models respectively obtain 28.1 and 25.2 BLEU points without using a single parallel sentence, outperforming the state of the art by more than 11 BLEU points. On low-resource languages like English-Urdu and English-Romanian, our methods achieve even better results than semi-supervised and supervised approaches leveraging the paucity of available bitexts. Our code for NMT and PBSMT is publicly available.

Our Summary

Facebook AI researchers acknowledge the lack of large parallel corpora for training machine translation systems and suggest a better way to leverage monolingual data for machine translation (MT). In particular, they argue that unsupervised MT can be successfully accomplished with suitable initialization of the translation model, language modeling and iterative back-translation. The researchers suggest two model variants, a neural and a phrase-based, and both of them outperform the state-of-the-art dramatically.

What’s the core idea of this paper?

Unsupervised MT can be accomplished with:
- suitable initialization of the translation models (i.e., byte-pair encodings);
- training language models in both source and target languages for improving the quality of translation models (e.g., performing local substitutions, word reordering);
- iterative back-translation for automatic generation of parallel data.
There are two model variants, neural and phrase-based:
- Neural machine translation has an additional important property – sharing of internal representations across languages.
- Phrase-based machine translation outperforms neural models on low-resource language pairs, is easy to interpret and fast to train.

What’s the key achievement?

Neural and phrase-based machine translation models remarkably outperform previous unsupervised baselines, for example:
- for English-French task, phrase-based translation model obtains a BLEU score of 28.1 (+11 BLEU points over the previous best result);
- for German-English task, neural and phrase-based translation models combined get a BLEU score of 25.2 (+ 10 BLEU points over the baseline).
The unsupervised phrase-based translation model achieves the same performance as its supervised counterpart trained on more than 100,000 parallel sentences.

What does the AI community think?

The paper was awarded the Best Paper Award at EMNLP 2018, a leading conference in the area of natural language processing.
“We could go now on a planet where people speak a language that nobody else speaks — and you can actually go and try to have a decent translation of what is said there,” Antoine Bordes, FAIR Paris Lab director, said to VentureBeat.

What are future research areas?

Searching for more effective instantiations of the suggested principles or other principles altogether.
Extending to semi-supervised settings.

What are possible business applications?

Improving the results of machine translation for language pairs where there are not enough parallel corpora to train supervised machine translation systems.

Where can you get implementation code?

The Facebook team provides the original implementation for this research paper on GitHub.

4. What you can cram into a single vector: Probing sentence embeddings for linguistic properties, by Alexis Conneau, German Kruszewski, Guillaume Lample, Loïc Barrault, Marco Baroni

Original Abstract

Although much effort has recently been devoted to training high-quality sentence embeddings, we still have a poor understanding of what they are capturing. “Downstream” tasks, often based on sentence classification, are commonly used to evaluate the quality of sentence representations. The complexity of the tasks makes it however difficult to infer what kind of information is present in the representations. We introduce here 10 probing tasks designed to capture simple linguistic features of sentences, and we use them to study embeddings generated by three different encoders trained in eight distinct ways, uncovering intriguing properties of both encoders and training methods.

Our Summary

A Facebook AI research team seeks to better understand what is captured by the sentence embeddings. The complexity of downstream tasks doesn’t allow us to get such an understanding directly. Thus, the paper introduces 10 probing tasks designed to capture simple linguistic features of sentences. The results received through these probing tasks reveal some interesting properties of both encoders and training methods.

What’s the core idea of this paper?

We have a number of modern sentence embeddings methods that show very good performance, but we still lack an understanding of what they are capturing.
The researchers address this problem by introducing 10 probing tasks to study embeddings generated by 3 different encoders (BiLSTM-last, BiLSTM-max, and Gated ConvNet) trained in 8 distinct ways.
The probing tasks test the extent to which sentence embeddings are preserving:
- surface information (number of words in the sentence, word content);
- syntactic information (word order, the hierarchical structure of the sentence, the sequence of top constituents);
- semantic information (tense of the main-clause verb, number of subjects and objects, randomly replaced words, the order of clauses).

What’s the key achievement?

Performing an extensive linguistic evaluation of modern sentence encoders.
Uncovering some intriguing properties of both encoders and training methods:
- Bag-of-Vectors is surprisingly good at capturing sentence-level properties due to redundancies in natural linguistic input.
- Different encoder architectures trained with the same objective with similar performance can result in different embeddings.
- The overall probing task performance of convolutional architecture is comparable to that of the best LSTM architecture.
- BiLSTM-max outperforms BiLSTM-last both in the downstream tasks
  and in the probing tasks. Furthermore, it achieves very good performance even without any training.

What does the AI community think?

The paper was presented at the Annual Meeting of the Association for Computational Linguistics (ACL 2018).
“It was very refreshing to see that rather than introducing ever shinier new models, many papers methodically investigated existing models and what they capture.”- Sebastian Ruder, Research Scientist at AYLIEN.

What are future research areas?

Extending probing tasks to other languages and linguistic domains.
Investigating how multi-task training affects probing task performance.
Finding more linguistically-aware universal encoders by leveraging the introduced probing tasks.

What are possible business applications?

A better understanding of the information captured by different pre-trained encoders will help researchers build more linguistically-aware encoders. This, in turn, will improve the NLP systems applied in business settings.

Where can you get implementation code?

Probing tasks described in this research paper are available on GitHub.

5. SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference, by Rowan Zellers, Yonatan Bisk, Roy Schwartz, Yejin Choi

Original Abstract

Given a partial description like “she opened the hood of the car,” humans can reason about the situation and anticipate what might come next (“then, she examined the engine”). In this paper, we introduce the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning.

We present SWAG, a new dataset with 113k multiple choice questions about a rich spectrum of grounded situations. To address the recurring challenges of the annotation artifacts and human biases found in many existing datasets, we propose Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data. To account for the aggressive adversarial filtering, we use state-of-the-art language models to massively oversample a diverse set of potential counterfactuals. Empirical results demonstrate that while humans can solve the resulting inference problems with high accuracy (88%), various competitive models struggle on our task. We provide comprehensive analysis that indicates significant opportunities for future research.

Our Summary

When you read “He pours the raw egg batter into the pan. He…”, you are likely to choose the correct ending “...lifts the pan and moves it around to shuffle the eggs.“ However, the answer is not obvious, it requires commonsense reasoning. SWAG, or Situations With Adversarial Generations, is a large-scale dataset created to support research toward Natural Language Inference (NLI) with commonsense reasoning. It was created using a novel approach – Adversarial Filtering – that can be applied to build future large-scale datasets in a cost-effective manner.

What’s the core idea of this paper?

SWAG contains 113K multiple-choice questions, collected using video captions:
- The context sentence comes from a video caption.
- The correct answer is the actual next caption in the video.
- Wrong answers are generated using Adversarial Filtering (AF).

The idea behind Adversarial Filtering:
- Massively overgenerating endings (wrong answers) and then selecting the subset that stylistically looks like a real ending.
- The filtering model determines which endings seem to be machine-generated. These endings are removed and replaced with new endings that the model thinks are human-written.

Finally, the entire dataset is validated by crowd workers.

What’s the key achievement?

Presenting a new challenging large-scale dataset for testing NLI systems.
Introducing Adversarial Filtering, a method that can be used for cost-effective construction of large-scale datasets with several benefits:
- The diversity of the sentences is not limited by the creativity of human writers.
- Dataset creator can arbitrarily raise the bar of difficulty during dataset construction.
- Humans don’t write the endings but only validate them, which is much cheaper.

What does the AI community think?

The paper was presented at EMNLP 2018, a leading conference in the area of natural language processing.
However, even before it was presented at this important NLP conference, the dataset was solved by Google’s new BERT model, which achieved an accuracy of 86.2% and got very close to human accuracy (88%).

What are future research areas?

Creating a more adversarial version of SWAG using better Adversarial Filtering and language models.

What are possible business applications?

The dataset can be helpful in building NLI systems with commonsense reasoning, thus improving the development of Q&A systems and conversational AI.

Where can you get the implementation code?

The SWAG dataset is available on GitHub.

6. Deep contextualized word representations, by Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer

Original Abstract

We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.

Our Summary

The team from the Allen Institute for Artificial Intelligence introduces a new type of deep contextualized word representation – Embeddings from Language Models (ELMo). In ELMO-enhanced models, each word is vectorized on the basis of the entire context in which it is used. Adding ELMo to the existing NLP systems results in 1) relative error reduction ranging from 6-20%, 2) a significantly lower number of epochs required to train the models and 3) a significantly reduced amount of training data needed to reach baseline performance.

What’s the core idea of this paper?

To generate word embeddings as a weighted sum of the internal states of a deep bi-directional language model (biLM), pre-trained on a large text corpus.
To include representations from all layers of a biLM as different layers represent different types of information.
To base ELMo representations on characters so that the network can use morphological clues to “understand” out-of-vocabulary tokens unseen in training.

What’s the key achievement?

Adding ELMo to the model leads to the new state-of-the-art results, with relative error reductions ranging from 6 – 20% across such NLP tasks as question answering, textual entailment, semantic role labeling, coreference resolution, named entity extraction, and sentiment analysis.
Enhancing the model with ELMo results in a significantly lower number of updates required to reach state-of-the-art performance. Thus, the Semantic Role Labeling (SRL) model with ELMo needs only 10 epochs to exceed the baseline maximum reached after 486 epochs of training.
Introducing ELMo to the model also significantly reduces the amount of training data needed to achieve the same level of performance. For example, for the SRL task, the ELMo-enhanced model needs only 1% of the training set to achieve the same performance as the baseline model with 10% of the training data.

What does the AI community think?

The paper was awarded as an Outstanding paper at NAACL, one of the most influential NLP conferences in the world.
The ELMo method introduced in the paper is considered as one of the greatest breakthroughs of 2018 and a staple in NLP for years to come.

What are future research areas?

Incorporating this method into specific tasks by concatenating ELMos with context-independent word embeddings.
Experimenting with concatenating ELMos with the output as well.

What are possible business applications?

ELMo significantly improves the performance of existing NLP systems and thus enhances:
- performance of chatbots that will be better at understanding humans and answering questions;
- classifying positive and negative reviews of customers;
- finding relevant information and documents etc.

Where can you get implementation code?

The Allen Institute provides pre-trained ELMo models in English and Portuguese. You can also retrain models using TensorFlow code.

7. Meta-Learning for Low-Resource Neural Machine Translation, by Jiatao Gu, Yong Wang, Yun Chen, Kyunghyun Cho, Victor O.K. Li

Original Abstract

In this paper, we propose to extend the recently introduced model-agnostic meta-learning algorithm (MAML) for low-resource neural machine translation (NMT). We frame low-resource translation as a meta-learning problem, and we learn to adapt to low-resource languages based on multilingual high-resource language tasks. We use the universal lexical representation to overcome the input-output mismatch across different languages. We evaluate the proposed meta-learning strategy using eighteen European languages (Bg, Cs, Da, De, El, Es, Et, Fr, Hu, It, Lt, Nl, Pl, Pt, Sk, Sl, Sv and Ru) as source tasks and five diverse languages (Ro, Lv, Fi, Tr and Ko) as target tasks. We show that the proposed approach significantly outperforms the multilingual, transfer learning based approach and enables us to train a competitive NMT system with only a fraction of training examples. For instance, the proposed approach can achieve as high as 22.04 BLEU on Romanian-English WMT’16 by seeing only 16,000 translated words (~600 parallel sentences).

Our Summary

Researchers from the University of Hong Kong and New York University use a model-agnostic meta-learning algorithm (MAML) to solve the problem of low-resource machine translation. In particular, they suggest using many high-resource language pairs to find the initial parameters of the model. This initialization allows then to train a new language model on a low-resource language pair using only a few steps of learning. For example, the model initialized using 18 high-resource language pairs, was able to achieve the BLEU score of 22.04 on the new language pair by seeing only around 600 parallel sentences.

What’s the core idea of this paper?

The paper introduces a new meta-learning method, MetaNMT, which assumes using many high-resource language pairs to find good initial parameters and then training a new translation model on a low-resource language starting from the found initial parameters.
Meta-learning can be applied to low-resource machine translation only if the input and output spaces are shared across all the source and target tasks. However, this is generally not the case since different languages have different vocabularies. To tackle this issue, the researchers dynamically build a vocabulary specific to each language using a key-value memory network.

What’s the key achievement?

Suggesting a new approach to Neural Machine Translation for extremely low resource languages, which:
- enables sharing the information between high-resource and extremely low-resource language pairs;
- uses only a few thousands of sentences for fine-tuning a new translation model on a low-resource language pair;
- fine-tunes for a new language in a couple of minutes.
The experiments demonstrate that:
- Meta-learning consistently performs better than multilingual transfer learning.
- The choice of language pair for a validation set for meta-learning impacts the performance of the resulting model. For example, Finnish-English benefits more when Romanian-English is used for validation, while Turkish-English prefers validation on Latvian-English.

What does the AI community think?

The paper was presented at EMNLP 2018, a leading conference in the area of natural language processing.
The presented approach got a Facebook award for the Low-resource Neural Machine Translation.

What are future research areas?

Meta-learning for semi-supervised Neural Machine Translation, or learning to learn from monolingual corpora.
Multi-modal meta-learning, when multiple meta-models are learned and a new language can freely choose a model to adapt from.

What are possible business applications?

MetaNMT can be used to improve the results of machine translation for language pairs where the available parallel corpora are extremely small.

8. Linguistically-Informed Self-Attention for Semantic Role Labeling, by Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, Andrew McCallum

Original Abstract

Current state-of-the-art semantic role labeling (SRL) uses a deep neural network with no explicit linguistic features. However, prior work has shown that gold syntax trees can dramatically improve SRL decoding, suggesting the possibility of increased accuracy from explicit modeling of syntax. In this work, we present linguistically-informed self-attention (LISA): a neural network model that combines multi-head self-attention with multi-task learning across dependency parsing, part-of-speech tagging, predicate detection and SRL. Unlike previous models which require significant pre-processing to prepare linguistic features, LISA can incorporate syntax using merely raw tokens as input, encoding the sequence only once to simultaneously perform parsing, predicate detection and role labeling for all predicates. Syntax is incorporated by training one attention head to attend to syntactic parents for each token. Moreover, if a high-quality syntactic parse is already available, it can be beneficially injected at test time without re-training our SRL model. In experiments on CoNLL-2005 SRL, LISA achieves new state-of-the-art performance for a model using predicted predicates and standard word embeddings, attaining 2.5 F1 absolute higher than the previous state-of-the-art on newswire and more than 3.5 F1 on out-of-domain data, nearly 10% reduction in error. On ConLL-2012 English SRL we also show an improvement of more than 2.5 F1. LISA also out-performs the state-of-the-art with contextually-encoded (ELMo) word representations, by nearly 1.0 F1 on news and more than 2.0 F1 on out-of-domain text.

Our Summary

Researchers from UMass Amherst College of Information and Computer Sciences and Google AI Language introduce Linguistically-Informed Self-Attention (LISA), a neural network model that combines deep learning and linguistic formalism, and thus more effectively utilizes syntactic parses to obtain semantic meaning. The experiments demonstrate that LISA achieves state-of-the-art performance not only in newswire, the writing style it has been taught to analyze, but can also generalize well to writing styles across different domains, such as journalism and fiction writing.

What’s the core idea of this paper?

Linguistically-informed self-attention (LISA) model is based on the Transformer encoder.
The input to the network can be a sequence of standard pre-trained GloVe word embeddings but better performance is achieved with the pre-trained ELMo representations combined with task-specific learned parameters.
To pass linguistic knowledge to later layers, the researchers suggest training the self-attention mechanism to attend to specific tokens corresponding to the syntactic structure of the sentence. Moreover, injection of auxiliary parse information can be performed at test time without re-training of the model.
Following a multi-task learning approach, the parameters of lower layers in the semantic role labeling (SRL) model are shared to predict part of speech and predicates.

What’s the key achievement?

Developing a new technique for integrating syntax into a neural network model.
Achieving the new state-of-the-art performance in semantic role labeling:
- with GloVe embeddings: + 2.5 F1 points on news and + 3.5 F1 points on out-of-domain text;
- with ELMo embeddings: + 1.0 F1 point on news and + 2.0 F1 points on out-of-domain text.

What does the AI community think?

The paper was awarded the Best Long Paper Award at EMNLP 2018, a leading conference in the area of natural language processing.
“This paper has a lot to like: a Transformer trained jointly on both syntactic and semantic tasks; the ability to inject high-quality parses at test time; and out-of-domain evaluation.” – Sebastian Ruder, Research Scientist at AYLIEN.

What are future research areas?

Improving the model’s parsing accuracy.
Developing better training techniques.
Adapting to more tasks.

What are possible business applications?

Semantic Role Labeling is important for many downstream NLP tasks, including:
- information extraction;
- question answering;
- automatic summarization;
- machine translation.

Where can you get implementation code?

The implementation of this research paper is available from its authors in TensorFlow.

9. A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks, by Victor Sanh, Thomas Wolf, and Sebastian Ruder

Original Abstract

Much effort has been devoted to evaluate whether multi-task learning can be leveraged to learn rich representations that can be used in various Natural Language Processing (NLP) downstream applications. However, there is still a lack of understanding of the settings in which multi-task learning has a significant effect. In this work, we introduce a hierarchical model trained in a multi-task learning setup on a set of carefully selected semantic tasks. The model is trained in a hierarchical fashion to introduce an inductive bias by supervising a set of low level tasks at the bottom layers of the model and more complex tasks at the top layers of the model. This model achieves state-of-the-art results on a number of tasks, namely Named Entity Recognition, Entity Mention Detection and Relation Extraction without hand-engineered features or external NLP tools like syntactic parsers. The hierarchical training supervision induces a set of shared semantic representations at lower layers of the model. We show that as we move from the bottom to the top layers of the model, the hidden states of the layers tend to represent more complex semantic information.

Our Summary

The researchers introduce a multi-task learning approach for a set of interrelated NLP tasks: Named Entity Recognition, Entity Mention Detection, Coreference Resolution and Relation Extraction. They show that a single model trained in a hierarchical fashion to solve all four tasks altogether beats state-of-the-art for 3 out of 4 tasks. Besides, the multi-task learning framework speeds up the training process remarkably compared to single-task models.

What’s the core idea of this paper?

Multi-task learning approach can be effectively used for a set of interdependent NLP tasks.
Four fundamental semantic NLP tasks: Named Entity Recognition, Entity Mention Detection, Coreference Resolution and Relation Extraction benefit from each other, and thus, can be combined in a single model.
The model assumes a hierarchy between the selected semantic tasks: some tasks are simpler, require less modification to the input, and thus, can be supervised at lower layers of the neural network, while other tasks are more difficult, require a more complex processing of inputs, and thus, should be supervised at higher layers of the neural network.

What’s the key achievement?

The Hierarchical Multi-Task Learning model (HMTL) beats state-of-the-art results on 3 out of 4 tasks, namely the Named Entity Recognition, Relation Extraction and Entity Mention Detection tasks.
The multi-task learning framework considerably accelerates the speed of training compared to single-task models.

What does the AI community think?

The paper will be presented at the highly selective AAAI conference in January 2019.
It is recognized by the AI community as a very impressive work.

What are future research areas?

Combining multi-task learning models with the pre-trained BERT encoder.
Searching for other settings in which multi-task learning has a significant impact.

What are possible business applications?

Business can leverage the advantages of this multi-task learning approach, namely high performance and high training speed, to enhance:
- performance of chatbots and voice assistants;
- finding relevant information in the documents;
- analyzing customer reviews, etc.

Where can you get implementation code?

The authors provide code for this research paper on GitHub.

10. Know What You Don’t Know: Unanswerable Questions for SQuAD, by Pranav Rajpurkar, Robin Jia, and Percy Liang

Original Abstract

Extractive reading comprehension systems can often locate the correct answer to a question in a context document, but they also tend to make unreliable guesses on questions for which the correct answer is not stated in the context. Existing datasets either focus exclusively on answerable questions, or use automatically generated unanswerable questions that are easy to identify. To address these weaknesses, we present SQuAD 2.0, the latest version of the Stanford Question Answering Dataset (SQuAD). SQuAD 2.0 combines existing SQuAD data with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD 2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering. SQuAD 2.0 is a challenging natural language understanding task for existing models: a strong neural system that gets 86% F1 on SQuAD 1.1 achieves only 66% F1 on SQuAD 2.0.

Our Summary

A Stanford University research group extends the famous Stanford Question Answering Dataset (SQuAD) with over 50,000 unanswerable questions. The answers to these questions cannot be found in the supporting paragraphs, yet the questions look very similar to the answerable questions. Even more, the supporting paragraphs contain plausible (but incorrect) answers to these questions. This makes the new SQuAD 2.0 extremely challenging for existing state-of-the-art models: a strong neural system that achieves an accuracy of 86% on the previous version of SQuAD gets only 66% after the unanswerable questions are introduced.

What’s the core idea of this paper?

Current Natural Language Understanding (NLU) systems are far from true language understanding, and one of the root causes for this is that existing Q&A datasets focus on questions for which a correct answer is guaranteed to exist in the context document.
To be really challenging, unanswerable questions should be created so that:
- they are relevant to the supporting paragraph;
- the paragraph contains a plausible answer which contains information of the same type as what the question asks for, but is incorrect.

What’s the key achievement?

Extending SQuAD with 53,777 new, unanswerable questions, and thus building a challenging, large-scale dataset that forces the NLU systems to understand when a question cannot be answered given the context.
Creating a new challenge for NLU systems by showing that existing models (with 66% of accuracy) are closer to a baseline that always abstains (48.9%) than to human accuracy (89.5%).
Showing that plausible answers do indeed act as effective distractors for NLU systems.

What does the AI community think?

The paper was announced as the Best Short Paper by the Association for Computational Linguistics (ACL) 2018.
The new dataset adds complexity to the NLU field and can actually contribute to a huge performance training boost in this research area.

What are future research areas?

Development of new models that “know what they don’t know,” and thus get a better understanding of natural language.

What are possible business applications?

Training reading comprehension models on this new dataset should improve their performance in the real-world scenarios where the answers are often not directly available.

Where can you get implementation code?

The official Standard SQuAD website has training datasets and a leaderboard comparing the top-performing models.

11. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling, by Shaojie Bai, J. Zico Kolter, Vladlen Koltun

Original Abstract

For most deep learning practitioners, sequence modeling is synonymous with recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and machine translation. Given a new sequence modeling task or dataset, which architecture should one use? We conduct a systematic evaluation of generic convolutional and recurrent architectures for sequence modeling. The models are evaluated across a broad range of standard tasks that are commonly used to benchmark recurrent networks. Our results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. We conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutional networks should be regarded as a natural starting point for sequence modeling tasks. To assist related work, we have made code available at http://github.com/locuslab/TCN.

Our Summary

The authors of this paper question the common assumption that recurrent architectures should be a default starting point for sequence modeling tasks. Their results suggest that generic temporal convolutional networks (TCNs) convincingly outperform canonical recurrent architectures such as long short-term memory networks (LSTMs) and gated recurrent unit networks (GRUs) across a broad range of sequence modeling tasks.

What’s the core idea of this paper?

Temporal convolutional networks (TCNs) designed using the recently introduced best practices such as dilated convolutions and residual connections, significantly outperform generic recurrent architectures across a comprehensive suite of sequence modeling tasks.
TCNs exhibit substantially longer memory that recurrent architectures, and are thus more suitable for tasks where a long history is required.

What’s the key achievement?

Providing an extensive systematic comparison of convolutional and recurrent architectures on sequence modeling tasks.
Designing a convolutional architecture that can serve as a convenient and still powerful starting point for sequence modeling tasks.

What does the AI community think?

“Always start with a CNN before reaching for an RNN. You’ll be surprised with how far you can get.” – Andrej Karpathy, Director of AI at Tesla.

What are future research areas?

Further architectural and algorithmic elaborations are needed to advance TCN’s performance across different sequence modeling tasks.

What are possible business applications?

Introduction of TCNs can improve the performance of AI systems relying on recurrent architectures for sequence modeling. This includes, among others, such tasks as:
- machine translation;
- speech recognition;
- music and voice generation.

Where can you get implementation code?

As stated in the paper abstract, the researchers have provided official code via a GitHub repository.
You can also check Keras implementation of TCN provided by Philippe Rémy.

12. Universal Language Model Fine-tuning for Text Classification, by Jeremy Howard and Sebastian Ruder

Original Abstract

Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch. We propose Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a language model. Our method significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18-24% on the majority of datasets. Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100x more data. We open source our pretrained models and code.

Our Summary

Howard and Ruder suggest using pre-trained models for solving a wide range of NLP problems. With this approach, you don’t need to train your model from scratch, but only fine-tune the original model. Their method, called Universal Language Model Fine-Tuning (ULMFiT) outperforms state-of-the-art results, reducing the error by 18-24%. Even more, with only 100 labeled examples, ULMFiT matches the performance of models trained from scratch on 10K labeled examples.

What’s the core idea of this paper?

To address the lack of labeled data and to make NLP classification easier and less time-consuming, the researchers suggest applying transfer learning to NLP problems. Thus, instead of training the model from scratch, you can use another model that has been trained to solve a similar problem as the basis, and then fine-tune the original model to solve your specific problem.
However, to be successful, this fine-tuning should take into account several important considerations:
- Different layers should be fine-tuned to different extents as they capture different kinds of information.
- Adapting model’s parameters to task-specific features will be more efficient if the learning rate is firstly linearly increased and then linearly decayed.
- Fine-tuning all layers at once is likely to result in catastrophic forgetting; thus, it would be better to gradually unfreeze the model starting from the last layer.

What’s the key achievement?

Significantly outperforming state-of-the-art: reducing the error by 18-24%.
Much less labeled data needed: with only 100 labeled examples and 50K unlabeled, matching the performance of learning from scratch on 100x more data.

What does the AI community think?

Availability of pre-trained ImageNet models has transformed the field of computer vision. ULMFiT can be of the same importance for NLP problems.
This method can be applied to any NLP task in any language. The reports are coming from all over the world about significant improvements over state-of-the-art for multiple languages, including German, Polish, Hindi, Indonesian, Chinese, and Malay.

What are future research areas?

Improving language model pre-training and fine-tuning.
Applying this new method to novel tasks and models (e.g., sequence labeling, natural language generation, entailment or question answering).

What are possible business applications?

ULMFiT can more efficiently solve a wide-range of NLP problems, including:
- identifying spam, bots, offensive comments;
- grouping articles by a specific feature;
- classifying positive and negative reviews;
- finding relevant documents etc.
Potentially, this method can also help with sequence-tagging and natural language generation.

Where can you get implementation code?

Fast.ai provides an official implementation of ULMFiT for text classification as part of their fast.ai library.

13. Improving Language Understanding by Generative Pre-Training, by Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever

Original Abstract

Natural language understanding comprises a wide range of diverse tasks such as textual entailment, question answering, semantic similarity assessment, and document classification. Although large unlabeled text corpora are abundant, labeled data for learning these specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task. In contrast to previous approaches, we make use of task-aware input transformations during fine-tuning to achieve effective transfer while requiring minimal changes to the model architecture. We demonstrate the effectiveness of our approach on a wide range of benchmarks for natural language understanding. Our general task-agnostic model outperforms discriminatively trained models that use architectures specifically crafted for each task, significantly improving upon the state of the art in 9 out of the 12 tasks studied. For instance, we achieve absolute improvements of 8.9% on commonsense reasoning (Stories Cloze Test), 5.7% on question answering (RACE), and 1.5% on textual entailment (MultiNLI).

Our Summary

The OpenAI team suggests significant improvements to language understanding by pre-training a language model on a diverse corpus of unlabeled text and then fine-tuning the model on each specific task using the labeled dataset. They also show that using Transformer models instead of traditional recurrent neural networks significantly improves the model’s performance. This approach outperformed the previous best results in 9 out of 12 tasks studied.

What’s the core idea of this paper?

Using a combination of unsupervised pre-training and supervised fine-tuning by learning the initial parameters of the neural network model on the unlabeled data and then adapting these parameters to the specific task using labeled data.
Avoiding extensive changes to the model architecture across tasks by using a traversal style approach:
- The pre-trained model is trained on the contiguous sequences of text, but the tasks like question answering or textual entailment have structured inputs (e.g., ordered sentence pairs, triplets of document, question and answer).
- The solution is to convert structured inputs into an ordered sequence that pre-trained model can process.
Using the Transformer models instead of LSTM because these models provide a more structured memory for handling long-term dependencies in the text.

What’s the key achievement?

For the task of Natural Language Inference (NLI), outperforming the state-of-the-art methods on 4 out of 5 datasets by achieving absolute improvements of 5% on SciTail and 5.8% on QNLI.
For the tasks of questions answering and commonsense reasoning, outperforming the previous best results by significant margins – up to 8.9% on Story Cloze, and 5.7% overall on RACE.
Obtaining state-of-the-art results on 2 out of 3 semantic similarity tasks by achieving absolute improvement of 4.2% on QQP.
For the classification task, obtaining the score of 45.4 on CoLA, while the previous best result was only 35.

What does the AI community think?

The work extends the ULMFiT research by using Transformer-based model instead of LSTM and applying the approach at a broader range of tasks.
“This is exactly where we were hoping our ULMFiT work would head – really great work from OpenAI!”, – Jeremy Howard, founder of fast.ai, co-author of the ULMFiT research paper.

What are future research areas?

Further research into unsupervised learning for natural language understanding and other domains for better comprehension of when and how unsupervised learning works.

What are possible business applications?

The approach suggested by OpenAI team enhances natural language understanding with unsupervised learning, and thus may assist with NLP applications where labeled datasets are sparse or unreliable.

Where can you get implementation code?

Open AI team provides code and models for this research paper on GitHub.

14. Dissecting Contextual Word Embeddings: Architecture and Representation, by Matthew E. Peters, Mark Neumann, Luke Zettlemoyer, Wen-tau Yih

Original Abstract

Contextual word representations derived from pre-trained bidirectional language models (biLMs) have recently been shown to provide significant improvements to the state of the art for a wide range of NLP tasks. However, many questions remain as to how and why these models are so effective. In this paper, we present a detailed empirical study of how the choice of neural architecture (e.g. LSTM, CNN, or self attention) influences both end task accuracy and qualitative properties of the representations that are learned. We show there is a tradeoff between speed and accuracy, but all architectures learn high quality contextual representations that outperform word embeddings for four challenging NLP tasks. Additionally, all architectures learn representations that vary with network depth, from exclusively morphological based at the word embedding layer through local syntax based in the lower contextual layers to longer range semantics such coreference at the upper layers. Together, these results suggest that unsupervised biLMs, independent of architecture, are learning much more about the structure of language than previously appreciated.

Our Summary

In this paper, the team from Allen Institute for Artificial Intelligence that earlier this year introduced ELMo embeddings, seeks a better understanding of pre-trained language model representations. For this purpose, they extensively study learned word and span representations on a set of carefully designed unsupervised and supervised tasks. The findings demonstrate that learned representations, independent of architecture, vary with network depth, from exclusively morphological based at the word embedding layer to longer range semantics at the upper layers.

What’s the core idea of this paper?

Pre-trained language models substantially improve performance for many NLP tasks, reducing the error rate for 10-25%. However, there is still no clear understanding of why and how pre-training works in practice.
To get a better understanding of the pre-trained language model representations, the researchers empirically study how the choice of neural architecture impacts:
- direct end-task accuracies;
- qualitative properties of the representations that are learned, i.e. how contextualized word representations encode notions of syntax and semantics.

What’s the key achievement?

Confirming that there is a tradeoff between speed and accuracy. Among the three architectures evaluated – LSTM, Transformer, and Gated CNN:
- LSTMs get the highest accuracy but are also the slowest;
- the Transformer and CNN based models are 3 times faster than the LSTM based ones but are also less accurate.
Demonstrating that information captured by pre-trained bidirectional language models (biLMs) varies with the network depth:
- word embedding layer of deep biLMs focuses exclusively on word morphology in contrast to traditional word vectors that encode also some semantic information at this layer;
- the lowest contextual layers of biLMs focus on local syntax;
- the upper layers can be used to induce more semantic content such as within-sentence pronominal coreferent clusters.
Showing that the biLM activations can be used to form phrase representations useful for syntactic tasks.

What does the AI community think?

The paper was presented at EMNLP 2018, a leading conference in the area of natural language processing.
“To me this really shows that pretrained language models indeed capture similar properties as computer vision models pre-trained on ImageNet.” – Sebastian Ruder, a research scientist at AYLIEN.

What are future research areas?

Exploring how much the quality of biLM representations can be improved by using very large models and datasets.
Enhancing models with the explicit syntactic structure or other linguistically motivated inductive biases.
Combining the purely unsupervised biLM training objective with existing annotated resources in a multitask or semi-supervised manner.

What are possible business applications?

With a better understanding of the information captured by the pre-trained language model representations, researchers can build more sophisticated models and enhance the performance of NLP systems applied in business settings.

Want Deeper Dives Into Specific AI Research Topics?

Due to popular demand, we’ve released several of these easy-to-read summaries and syntheses of major research papers for different subtopics within AI and machine learning.

Update: 2019 Research Summaries Are Released

We’ll let you know when we release more summary articles like this one.