What Every NLP Engineer Needs to Know About Pre-Trained Language Models

TOP research papers pre-trained language models

Practical applications of Natural Language Processing (NLP) have gotten significantly cheaper, faster, and easier due to the transfer learning capabilities enabled by pre-trained language models. Transfer learning enables engineers to pre-train an NLP model on one large dataset and then quickly fine-tune the model to adapt to other NLP tasks.

This new approach enables NLP models to learn both lower-level and higher-level features of language, leading to much better model performance for virtually all standard NLP tasks and a new standard for industry best practices.

To help you quickly understand the significance of this technical achievement and how it accelerates your own work in NLP, we’ve summarized the key lessons you should know in easy-to-read bullet-point format. We’ve also included summaries of the 3 most important research papers in the space that you need to be aware of.

If these accessible AI research analyses & summaries are useful for you, you can subscribe to receive our regular indusry updates below.

How Do Pre-Trained Language Models Accelerate Natural Language Processing (NLP) Applications?

What’s the significance of pre-trained language models for the field of NLP?

Instead of training the model from scratch, you can use another pre-trained model as the basis and only fine-tune it to solve the specific NLP task.
Using pre-trained models allows you to achieve the same or even better performance much faster and with much less labeled data.

What are the three big pre-trained achievements? What are their main takeaways?

Using pre-trained language models is one of the most exciting directions for NLP today, and lots of papers introduced recently explore transfer learning. However, here we want to highlight three research papers that are at the core of this new trend in NLP:
- ULMFiT, or Universal Language Model Fine-Tuning method, is likely the first effective approach to fine-tuning the language model. The authors demonstrate the importance of several novel techniques, including discriminative fine-tuning, slanted triangular learning rate, and gradual unfreezing, for retaining previous knowledge and avoiding catastrophic forgetting during fine-tuning.
- ELMo word representations, or Embeddings from Language Models, are generated in a way to take the entire context into consideration. In particular, they are created as a weighted sum of the internal states of a deep bi-directional language model (biLM), pre-trained on a large text corpus. Furthermore, ELMo representations are based on characters so that the network can understand even out-of-vocabulary tokens unseen in training.
- BERT, or Bidirectional Encoder Representations from Transformers, is a new cutting-edge model that considers the context from both the left and the right sides of each word. The two key success factors are (1) masking part of input tokens to avoid cycles where words indirectly “see themselves”, and (2) pre-training a sentence relationship model. Finally, BERT is also a very big model trained on a huge word corpus.

How does it make your job easier? How can you use them?

You can use pre-trained language models instead of training your model from scratch and achieve better performance much faster and with less training data:
- The fastai library provides modules necessary to train and use ULMFiT models. Moreover, pre-trained Wikitext 103 model is also available.
- Allen Institute for Artificial Intelligence provides pre-trained ELMo models in English and Portuguese. You can also retrain models using TensorFlow code.
- You are also free to use pre-trained BERT models released by Google Research team.

What’s been built on top of them since?

ULMFiT work was further extended and applied in the following papers:
- Improving Language Understanding by Generative Pre-Training by Radford et al.;
- Universal Language Model Fine-Tuning with Subword Tokenization for Polish by Czapla, Howard, and Kardas;
- Universal Language Model Fine-tuning for Patent Classification by Jason Hepburn.
ELMo embeddings have been already used in a number of important research papers including:
- Linguistically-Informed Self-Attention for Semantic Role Labeling by Strubell et al.;
- Language Model Pre-training for Hierarchical Document Representations by Chang et al.;
- Deep Enhanced Representation for Implicit Discourse Relation Recognition by Bai and Zhao.
BERT was introduced only in late 2018 but has been already a basis for some further research advancements:
- A BERT Baseline for the Natural Questions by Alberti, Lee, and Collins;
- SDNet: Contextualized Attention-based Deep Network for Conversational Question Answering by Zhu et al.;
- Extracting Multiple-Relations in One-Pass with Pre-Trained Transformers by Wang et al.

Detailed Summaries of Relevant NLP Language Model Research Papers

If you’d like to dive deeper into the details, we’ve also summarized the key research papers that were published on this topic in the last year.

Universal Language Model Fine-tuning for Text Classification
Deep contextualized word representations
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

1. Universal Language Model Fine-tuning for Text Classification, by Jeremy Howard and Sebastian Ruder

Original Abstract

Inductive transfer learning has greatly impacted computer vision, but existing approaches in NLP still require task-specific modifications and training from scratch. We propose Universal Language Model Fine-tuning (ULMFiT), an effective transfer learning method that can be applied to any task in NLP, and introduce techniques that are key for fine-tuning a language model. Our method significantly outperforms the state-of-the-art on six text classification tasks, reducing the error by 18-24% on the majority of datasets. Furthermore, with only 100 labeled examples, it matches the performance of training from scratch on 100x more data. We open source our pretrained models and code.

Our summary

Howard and Ruder suggest using pre-trained models for solving a wide range of NLP problems. With this approach, you don’t need to train your model from scratch, but only fine-tune the original model. Their method, called Universal Language Model Fine-Tuning (ULMFiT) outperforms state-of-the-art results, reducing the error by 18-24%. Even more, with only 100 labeled examples, ULMFiT matches the performance of models trained from scratch on 10K labeled examples.

What’s the core idea of this paper?

To address the lack of labeled data and to make NLP classification easier and less time-consuming, the researchers suggest applying transfer learning to NLP problems. Thus, instead of training the model from scratch, you can use another model that has been trained to solve a similar problem as the basis, and then fine-tune the original model to solve your specific problem.
However, to be successful, this fine-tuning should take into account several important considerations:
- Different layers should be fine-tuned to different extents as they capture different kinds of information.
- Adapting model’s parameters to task-specific features will be more efficient if the learning rate is firstly linearly increased and then linearly decayed.
- Fine-tuning all layers at once is likely to result in catastrophic forgetting; thus, it would be better to gradually unfreeze the model starting from the last layer.

What’s the key achievement?

Significantly outperforming state-of-the-art: reducing the error by 18-24%.
Much less labeled data needed: with only 100 labeled examples and 50K unlabeled, matching the performance of learning from scratch on 100x more data.

What does the AI community think?

Availability of pre-trained ImageNet models has transformed the field of computer vision. ULMFiT can be of the same importance for NLP problems.
This method can be applied to any NLP task in any language. The reports are coming from all over the world about significant improvements over state-of-the-art for multiple languages, including German, Polish, Hindi, Indonesian, Chinese, and Malay.

What are future research areas?

Improving language model pre-training and fine-tuning.
Applying this new method to novel tasks and models (e.g., sequence labeling, natural language generation, entailment or question answering).

What are possible business applications?

ULMFiT can more efficiently solve a wide-range of NLP problems, including:
- identifying spam, bots, offensive comments;
- grouping articles by a specific feature;
- classifying positive and negative reviews;
- finding relevant documents etc.
Potentially, this method can also help with sequence-tagging and natural language generation.

Where can you get implementation code?

Fast.ai provides an official implementation of ULMFiT for text classification as part of their fast.ai library.

2. Deep contextualized word representations, by Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer

Original Abstract

We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.

Our summary

The team from the Allen Institute for Artificial Intelligence introduces a new type of deep contextualized word representation – Embeddings from Language Models (ELMo). In ELMO-enhanced models, each word is vectorized on the basis of the entire context in which it is used. Adding ELMo to the existing NLP systems results in 1) relative error reduction ranging from 6-20%, 2) a significantly lower number of epochs required to train the models and 3) a significantly reduced amount of training data needed to reach baseline performance.

What’s the core idea of this paper?

To generate word embeddings as a weighted sum of the internal states of a deep bi-directional language model (biLM), pre-trained on a large text corpus.
To include representations from all layers of a biLM as different layers represent different types of information.
To base ELMo representations on characters so that the network can use morphological clues to “understand” out-of-vocabulary tokens unseen in training.

What’s the key achievement?

Adding ELMo to the model leads to the new state-of-the-art results, with relative error reductions ranging from 6 – 20% across such NLP tasks as question answering, textual entailment, semantic role labeling, coreference resolution, named entity extraction, and sentiment analysis.
Enhancing the model with ELMo results in a significantly lower number of updates required to reach state-of-the-art performance. Thus, the Semantic Role Labeling (SRL) model with ELMo needs only 10 epochs to exceed the baseline maximum reached after 486 epochs of training.
Introducing ELMo to the model also significantly reduces the amount of training data needed to achieve the same level of performance. For example, for the SRL task, the ELMo-enhanced model needs only 1% of the training set to achieve the same performance as the baseline model with 10% of the training data.

What does the AI community think?

The paper was awarded as an Outstanding paper at NAACL, one of the most influential NLP conferences in the world.
The ELMo method introduced in the paper is considered as one of the greatest breakthroughs of 2018 and a staple in NLP for years to come.

What are future research areas?

Incorporating this method into specific tasks by concatenating ELMos with context-independent word embeddings.
Experimenting with concatenating ELMos with the output as well.

What are possible business applications?

ELMo significantly improves the performance of existing NLP systems and thus enhances:
- performance of chatbots that will be better at understanding humans and answering questions;
- classifying positive and negative reviews of customers;
- finding relevant information and documents etc.

Where can you get implementation code?

The Allen Institute provides pre-trained ELMo models in English and Portuguese. You can also retrain models using TensorFlow code.

3. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

Original Abstract

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE benchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5% absolute improvement), outperforming human performance by 2.0%.

Our summary

A Google AI team presents a new cutting-edge model for Natural Language Processing (NLP) – BERT, or Bidirectional Encoder Representations from Transformers. Its design allows the model to consider the context from both the left and the right sides of each word. While being conceptually simple, BERT obtains new state-of-the-art results on eleven NLP tasks, including question answering, named entity recognition and other tasks related to general language understanding.

Top NLP Research Papers of 2018 Summarized By Mariya Yao TOPBOTS

What’s the core idea of this paper?

Training a deep bidirectional model by randomly masking a percentage of input tokens – thus, avoiding cycles where words can indirectly “see themselves”.
Also pre-training a sentence relationship model by building a simple binary classification task to predict whether sentence B immediately follows sentence A, thus allowing BERT to better understand relationships between sentences.
Training a very big model (24 Transformer blocks, 1024-hidden, 340M parameters) with lots of data (3.3 billion word corpus).

What’s the key achievement?

Advancing the state-of-the-art for 11 NLP tasks, including:
- getting a GLUE score of 80.4%, which is 7.6% of absolute improvement from the previous best result;
- achieving 93.2% accuracy on SQuAD 1.1 and outperforming human performance by 2%.
Suggesting a pre-trained model, which doesn’t require any substantial architecture modifications to be applied to specific NLP tasks.

What does the AI community think?

BERT model marks a new era of NLP.
In a nutshell, two unsupervised tasks together (“fill in the blank” and “does sentence B comes after sentence A?” ) provide great results for many NLP tasks.
Pre-training of language models becomes a new standard.

What are future research areas?

Testing the method on a wider range of tasks.
Investigating the linguistic phenomena that may or may not be captured by BERT.

What are possible business applications?

BERT may assist businesses with a wide range of NLP problems, including:
- chatbots for better customer experience;
- analysis of customer reviews;
- the search for relevant information, etc.

Where can you get implementation code?

Google Research has released an official Github repository with Tensorflow code and pre-trained models for BERT.
PyTorch implementation of BERT is also available on GitHub.

Did you enjoy this AI research analyses and summary? You can read our longer summary of 14 top Natural Language Processing (NLP) research papers from 2018.

We’ll let you know when we release more summary articles like this one.

How Do Pre-Trained Language Models Accelerate Natural Language Processing (NLP) Applications?

What’s the significance of pre-trained language models for the field of NLP?

What are the three big pre-trained achievements? What are their main takeaways?

How does it make your job easier? How can you use them?

What’s been built on top of them since?

Detailed Summaries of Relevant NLP Language Model Research Papers

1. Universal Language Model Fine-tuning for Text Classification, by Jeremy Howard and Sebastian Ruder

Original Abstract

Our summary

What’s the core idea of this paper?

What’s the key achievement?

What does the AI community think?

What are future research areas?

What are possible business applications?

Where can you get implementation code?

2. Deep contextualized word representations, by Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, Luke Zettlemoyer

Original Abstract

Our summary

What’s the core idea of this paper?

What’s the key achievement?

What does the AI community think?

What are future research areas?

What are possible business applications?

Where can you get implementation code?

3. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova

Original Abstract

Our summary

What’s the core idea of this paper?

What’s the key achievement?

What does the AI community think?

What are future research areas?

What are possible business applications?

Where can you get implementation code?

Enjoy this article? Sign up for more AI research updates.

Related

Reader Interactions

About Mariya Yao

Comments

Leave a Reply

Footer

About TOPBOTS