Large pretrained language models are definitely the main trend of the latest research advances in natural language processing (NLP).
While lots of AI experts agree with Anna Rogers’s statement that getting state-of-the-art results with just more data and computing power is not research news, other NLP opinion leaders also see some positive moments in the current trend. For example, Sebastian Ruder, a research scientist at DeepMind, points out that these big language frameworks help us see the fundamental limitations of the current paradigm.
With transformers occupying the NLP leaderboards, it’s often hard to follow what the amendments are that enabled a new big language model to set another state-of-the-art result. To help you stay up to date with the latest NLP breakthroughs, we’ve summarized research papers featuring the current leaders of the GLUE benchmark: XLNet from Carnegie Mellon University, ERNIE 2.0 from Baidu, and RoBERTa from Facebook AI.
If these accessible AI research analyses & summaries are useful for you, you can subscribe to receive our regular industry updates below.
If you’d like to skip around, here are the papers we featured:
- XLNet: Generalized Autoregressive Pretraining for Language Understanding
- ERNIE 2.0: A Continual Pre-training Framework for Language Understanding
- RoBERTa: A Robustly Optimized BERT Pretraining Approach
Big Language Frameworks
1. XLNet: Generalized Autoregressive Pretraining for Language Understanding, by Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le
Original Abstract
With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, XLNet outperforms BERT on 20 tasks, often by a large margin, and achieves state-of-the-art results on 18 tasks including question answering, natural language inference, sentiment analysis, and document ranking.
Our Summary
The researchers from Carnegie Mellon University and Google have developed a new model, XLNet, for natural language processing (NLP) tasks such as reading comprehension, text classification, sentiment analysis, and others. XLNet is a generalized autoregressive pretraining method that leverages the best of both autoregressive language modeling (e.g., Transformer-XL) and autoencoding (e.g., BERT) while avoiding their limitations. The experiments demonstrate that the new model outperforms both BERT and Transformer-XL and achieves state-of-the-art performance on 18 NLP tasks.
What’s the core idea of this paper?
- XLNet combines the bidirectional capability of BERT with the autoregressive technology of Transformer-XL:
- Like BERT, XLNet uses bidirectional context, which means it looks at the words before and after a given token to predict what it should be. To this end, XLNet maximizes the expected log-likelihood of a sequence with respect to all possible permutations of the factorization order.
- As an autoregressive language model, XLNet doesn’t rely on data corruption, and thus avoids BERT’s limitations due to masking – i.e., pretrain-finetune discrepancy and the assumption that unmasked tokens are independent of each other.
- To further improve architectural designs for pretraining, XLNet integrates the segment recurrence mechanism and relative encoding scheme of Transformer-XL.
What’s the key achievement?
- XLnet outperforms BERT on 20 tasks, often by a large margin.
- The new model achieves state-of-the-art performance on 18 NLP tasks including question answering, natural language inference, sentiment analysis, and document ranking.
What does the AI community think?
- The paper was accepted for oral presentation at NeurIPS 2019, the leading conference in artificial intelligence.
- “The king is dead. Long live the king. BERT’s reign might be coming to an end. XLNet, a new model by people from CMU and Google outperforms BERT on 20 tasks.” – Sebastian Ruder, research scientist at Deepmind.
- “XLNet will probably be an important tool for any NLP practitioner for a while…[it is] the latest cutting-edge technique in NLP.” – Keita Kurita, Carnegie Mellon University.
What are future research areas?
- Extending XLNet to new areas, such as computer vision and reinforcement learning.
What are possible business applications?
- XLNet may assist businesses with a wide range of NLP problems, including:
- chatbots for first-line customer support or answering product inquiries;
- sentiment analysis for gauging brand awareness and perception based on customer reviews and social media;
- the search for relevant information in document bases or online, etc.
Where can you get implementation code?
- The authors have released the official Tensorflow implementation of XLNet.
- PyTorch implementation of the model is also available on GitHub.
2. ERNIE 2.0: A Continual Pre-training Framework for Language Understanding, by Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao Tian, Hua Wu, Haifeng Wang
Original Abstract
Recently, pre-trained models have achieved state-of-the-art results in various language understanding tasks, which indicates that pre-training on large-scale corpora may play a crucial role in natural language processing. Current pre-training procedures usually focus on training the model with several simple tasks to grasp the co-occurrence of words or sentences. However, besides co-occurring, there exists other valuable lexical, syntactic and semantic information in training corpora, such as named entity, semantic closeness and discourse relations. In order to extract to the fullest extent, the lexical, syntactic and semantic information from training corpora, we propose a continual pre-training framework named ERNIE 2.0 which builds and learns incrementally pre-training tasks through constant multi-task learning. Experimental results demonstrate that ERNIE 2.0 outperforms BERT and XLNet on 16 tasks including English tasks on GLUE benchmarks and several common tasks in Chinese. The source codes and pre-trained models have been released at https://github.com/PaddlePaddle/ERNIE.
Our Summary
Most state-of-the-art natural language processing models analyze the co-occurrence of words in sentences for pretraining. However, there is additional information contained in sentences in the form of sentence order and proximity, named entities, and semantic similarities, that models are not capturing. Researchers at Baidu tackled this problem by creating a continual pretraining framework ERNIE 2.0 (Enhanced Representation through KNowledge IntEgration), in which customized tasks are continuously introduced and trained through multi-task learning. As a result, the model can encode lexical, syntactic, and semantic information across tasks without forgetting previously trained parameters. ERNIE 2.0 outperforms BERT and XLNet on the English-language GLUE benchmark and sets a new state of the art for Chinese-language processing.
What’s the core idea of this paper?
- Existing natural language processing models mainly solve word-level and sentence-level inference tasks by leveraging co-occurrence information of words or sentences, and fail to grasp other valuable information contained in training corpora.
- To fully learn the lexical, syntactic, and semantic information contained in the text, the Baidu research team introduces a continual pretraining framework ERNIE 2.0, where pretraining tasks are incrementally introduced and learned through multi-task learning:
- Different customized tasks can be freely introduced at any time.
- These tasks share the same encoding networks and are trained through multi-task learning.
- When a new task arrives, the framework incrementally trains the distributed representations without forgetting the previously trained parameters.
What’s the key achievement?
- According to the experiments reported in the paper, ERNIE 2.0 outperforms BERT and XLNet on the English-language GLUE benchmark:
- It gets an average score of 83.6 compared to BERT’s score of 80.5.
- It performs better than XLNet on seven out of eight individual task categories.
- ERNIE 2.0 also sets new state-of-the-art performance levels on numerous Chinese NLP tasks.
What does the AI community think?
- ERNIE 2.0 is a trending paper on GitHub.
What are future research areas?
- Introducing additional and more varied pretraining tasks into the continual pretraining framework to further improve the model’s performance.
What are possible business applications?
- Like other big pretrained language frameworks, ERNIE 2.0 may assist businesses with a wide range of NLP tasks, including chatbots, sentiment analysis, information retrieval, etc.
Where can you get implementation code?
- The source code and pretrained models used in this study are available on GitHub.
3. RoBERTa: A Robustly Optimized BERT Pretraining Approach, by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov
Original Abstract
Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.
Our Summary
Natural language processing models have made significant advances thanks to the introduction of pretraining methods, but the computational expense of training has made replication and fine-tuning parameters difficult. In this study, Facebook AI and the University of Washington researchers analyzed the training of Google’s Bidirectional Encoder Representations from Transformers (BERT) model and identified several changes to the training procedure that enhance its performance. Specifically, the researchers used a new, larger dataset for training, trained the model over far more iterations, and removed the next sequence prediction training objective. The resulting optimized model, RoBERTa (Robustly Optimized BERT Approach), matched the scores of the recently introduced XLNet model on the GLUE benchmark.
What’s the core idea of this paper?
- The Facebook AI research team found that BERT was significantly undertrained and suggested an improved recipe for its training, called RoBERTa:
- More data: 160GB of text instead of the 16GB dataset originally used to train BERT.
- Longer training: increasing the number of iterations from 100K to 300K and then further to 500K.
- Larger batches: 8K instead of 256 in the original BERT base model.
- Larger byte-level BPE vocabulary with 50K subword units instead of character-level BPE vocabulary of size 30K.
- Removing the next sequence prediction objective from the training procedure.
- Dynamically changing the masking pattern applied to the training data.
What’s the key achievement?
- RoBERTa outperforms BERT in all individual tasks on the General Language Understanding Evaluation (GLUE) benchmark.
- The new model matches the recently introduced XLNet model on the GLUE benchmark and sets a new state of the art in four out of nine individual tasks.
What are future research areas?
- Incorporating more sophisticated multi-task finetuning procedures.
What are possible business applications?
- Big pretrained language frameworks like RoBERTa can be leveraged in the business setting for a wide range of downstream tasks, including dialogue systems, question answering, document classification, etc.
Where can you get implementation code?
- The models and code used in this study are available on GitHub.
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more summary articles like this one.
Leave a Reply
You must be logged in to post a comment.