NLP research advances in 2020 are still dominated by large pre-trained language models, and specifically transformers. There were many interesting updates introduced this year that have made transformer architecture more efficient and applicable to long documents.
Another hot topic relates to the evaluation of NLP models in different applications. We still lack evaluation approaches that clearly show where a model fails and how to fix it.
Also, with the growing capabilities of language models such as GPT-3, conversational AI is enjoying a new wave of interest. Chatbots are improving, with several impressive bots like Meena and Blender introduced this year by top technology companies.
To help you stay up to date with the latest NLP research breakthroughs, we’ve curated and summarized the key research papers in natural language processing from 2020. The papers cover the leading language models, updates to the transformer architecture, novel evaluation approaches, and major advances in conversational AI.
Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.
If you’d like to skip around, here are the papers we featured:
- WinoGrande: An Adversarial Winograd Schema Challenge at Scale
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
- Reformer: The Efficient Transformer
- Longformer: The Long-Document Transformer
- ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
- Language Models are Few-Shot Learners
- Beyond Accuracy: Behavioral Testing of NLP models with CheckList
- Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics
- Towards a Human-like Open-Domain Chatbot
- Recipes for Building an Open-Domain Chatbot
Best NLP Research Papers 2020
1. WinoGrande: An Adversarial Winograd Schema Challenge at Scale, by Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi
Original Abstract
The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense.
To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. The best state-of-the-art methods on WinoGrande achieve 59.4-79.1%, which are 15-35% below human performance of 94.0%, depending on the amount of the training data allowed.
Furthermore, we establish new state-of-the-art results on five related benchmarks – WSC (90.1%), DPR (93.1%), COPA (90.6%), KnowRef (85.6%), and Winogender (97.1%). These results have dual implications: on one hand, they demonstrate the effectiveness of WinoGrande when used as a resource for transfer learning. On the other hand, they raise a concern that we are likely to be overestimating the true capabilities of machine commonsense across all these benchmarks. We emphasize the importance of algorithmic bias reduction in existing and future benchmarks to mitigate such overestimation.
Our Summary
The research group from the Allen Institute for Artificial Intelligence introduces WinoGrande, a new benchmark for commonsense reasoning. They build on the design of the famous Winograd Schema Challenge (WSC) benchmark but significantly increase the scale of the dataset to 44K problems and reduce systematic bias using a novel AfLite algorithm. The experiments demonstrate that state-of-the-art methods achieve up to 79.1% accuracy on WinoGrande, which is significantly below the human performance of 94%. Furthermore, the researchers show that WinoGrande is an effective resource for transfer learning, by using a RoBERTa model fine-tuned with WinoGrande to achieve new state-of-the-art results on WSC and four other related benchmarks.
What’s the core idea of this paper?
- The authors claim that existing benchmarks for commonsense reasoning suffer from systematic bias and annotation artifacts, leading to overestimation of the true capabilities of machine intelligence on commonsense reasoning.
- They introduce WinoGrande, a new large-scale dataset for commonsense reasoning. Their approach has two key features:
- A carefully designed crowdsourcing procedure:
- Crowdworkers were asked to write twin sentences that meet the WSC requirements and contain certain anchor words. This new requirement is aimed at improving the creativity of crowdworkers.
- Collected problems were validated through a distinct set of three crowdworkers. Out of 77K collected questions, 53K were deemed valid.
- A novel algorithm AfLite for systematic bias reduction:
- It generalizes human-detectable biases based on word occurrences to machine-detectable biases based on embedding occurrences.
- After applying the AfLite algorithm, the debiased WinoGrande dataset contains 44K samples.
- A carefully designed crowdsourcing procedure:
What’s the key achievement?
- WinoGrande is easy for humans and challenging for machines:
- Wino Knowledge Hunting (WKH) and Ensemble LMs only achieve chance-level performance (50%);
- RoBERTa achieves 79.1% test-set accuracy;
- whereas human performance achieves 94% accuracy.
- WinoGrande is also an effective resource for transfer learning. The RoBERTa-based model fine-tuned on WinoGrande achieved a new state of the art on WSC and four other related datasets:
- 90.1% on WSC;
- 93.1% on DPR;
- 90.6% on COPA;
- 85.6% on KnowRef; and
- 97.1% on Winogender.
What does the AI community think?
- The paper received the Outstanding Paper Award at AAAI 2020, one of the key conferences in artificial intelligence.
What are future research areas?
- Exploring new algorithmic approaches for systematic bias reduction.
- Debiasing other NLP benchmarks.
Where can you get implementation code?
- The dataset can be downloaded from the WinoGrande project page.
- The implementation code is available on GitHub.
- And here is the WinoGrande leaderboard.
2. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu
Original Abstract
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.
Our Summary
The Google research team suggests a unified approach to transfer learning in NLP with the goal to set a new state of the art in the field. To this end, they propose treating each NLP problem as a “text-to-text” problem. Such a framework allows using the same model, objective, training procedure, and decoding process for different tasks, including summarization, sentiment analysis, question answering, and machine translation. The researchers call their model a Text-to-Text Transfer Transformer (T5) and train it on the large corpus of web-scraped data to get state-of-the-art results on a number of NLP tasks.
What’s the core idea of this paper?
- The paper has several important contributions:
- Providing a comprehensive perspective on where the NLP field stands by exploring and comparing existing techniques.
- Introducing a new approach to transfer learning in NLP by suggesting to treat every NLP problem as a text-to-text task:
- The mode understands which tasks should be performed thanks to the task-specific prefix added to the original input sentence (e.g., “translate English to German:”, “summarize:”).
- Presenting and releasing a new dataset consisting of hundreds of gigabytes of clean web-scraped English text, the Colossal Clean Crawled Corpus (C4).
- Training a large (up to 11B parameters) model, called Text-to-Text Transfer Transformer (T5) on the C4 dataset.
What’s the key achievement?
- The T5 model with 11 billion parameters achieved state-of-the-art performance on 17 out of 24 tasks considered, including:
- the GLUE score of 89.7 with substantially improved performance on CoLA, RTE, and WNLI tasks;
- the Exact Match score of 90.06 on SQuAD dataset;
- the SuperGLUE score of 88.9, which is a very significant improvement over the previous state-of-the-art result (84.6) and very close to human performance (89.8);
- the ROUGE-2-F score of 21.55 on CNN/Daily Mail abstractive summarization task.
What are future research areas?
- Researching the methods to achieve stronger performance with cheaper models.
- Exploring more efficient knowledge extraction techniques.
- Further investigating the language-agnostic models.
What are possible business applications?
- Even though the introduced model has billions of parameters and can be too heavy to be applied in the business setting, the presented ideas can be used to improve the performance on different NLP tasks, including summarization, question answering, and sentiment analysis.
Where can you get implementation code?
- The pretrained models together with the dataset and code are released on GitHub.
3. Reformer: The Efficient Transformer, by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya
Original Abstract
Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O(L2) to O(L log L), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
Our Summary
The leading Transformer models have become so big that they can be realistically trained only in large research laboratories. To address this problem, the Google Research team introduces several techniques that improve the efficiency of Transformers. In particular, they suggest (1) using reversible layers to allow storing the activations only once instead of for each layer, and (2) using locality-sensitive hashing to avoid costly softmax computation in the case of full dot-product attention. Experiments on several text tasks demonstrate that the introduced Reformer model matches the performance of the full Transformer but runs much faster and with much better memory efficiency.
What’s the core idea of this paper?
- The leading Transformer models require huge computational resources because of the very high number of parameters and several other factors:
- The activations of every layer need to be stored for back-propagation.
- The intermediate feed-forward layers account for a large fraction of memory use since their depth is often much larger than the depth of attention activations.
- The complexity of attention on a sequence of length L is O(L2).
- To address these problems, the research team introduces the Reformer model with the following improvements:
- using reversible layers to store only a single copy of activations;
- splitting activations inside the feed-forward layers and processing them in chunks;
- approximating attention computation based on locality-sensitive hashing.
What’s the key achievement?
- By analyzing the introduced techniques one by one, the authors show that model accuracy is not sacrificed by:
- switching to locality-sensitive hashing attention;
- using reversible layers.
- Reformer performs on par with the full Transformer model while demonstrating much higher speed and memory efficiency:
- For example, on the newstest2014 task for machine translation from English to German, the Reformer base model gets a BLEU score of 27.6 compared to Vaswani’s et al. (2017) BLEU score of 27.3.
What does the AI community think?
- The paper was selected for oral presentation at ICLR 2020, the leading conference in deep learning.
What are possible business applications?
- The suggested efficiency improvements enable more widespread Transformer application, especially for the tasks that depend on large-context data, such as:
- text generation;
- visual content generation;
- music generation;
- time-series forecasting.
Where can you get implementation code?
- The official code implementation from Google is publicly available on GitHub.
- The PyTorch implementation of Reformer is also available on GitHub.
4. Longformer: The Long-Document Transformer, by Iz Beltagy, Matthew E. Peters, Arman Cohan
Original Abstract
Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer’s attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA.
Our Summary
Self-attention is one of the key factors behind the success of Transformer architecture. However, it also makes transformer-based models hard to apply to long documents. The existing techniques usually divide the long input into a number of chunks and then use complex architectures to combine information across these chunks. The research team from the Allen Institute for Artificial Intelligence introduces a more elegant solution to this problem. The suggested Longformer model employs an attention pattern that combines local windowed attention with task-motivated global attention. This attention mechanism scales linearly with the sequence length and enables processing of documents with thousands of tokens. The experiments demonstrate that Longformer achieves state-of-the-art results on character-level language modeling tasks, and when pre-trained, consistently outperforms RoBERTa on long-document tasks.
What’s the core idea of this paper?
- The computational requirements of self-attention grow quadratically with sequence length, making it hard to process on current hardware.
- To address this issue, the researchers present Longformer, a modified version of Transformer architecture that:
- allows memory usage to scale linearly, and not quadratically, with the sequence length;
- includes an attention mechanism that combines:
- a windowed local-context self-attention to build contextual representations;
- an end task motivated global attention to encode inductive bias about the task and build full sequence representation.
- Since the implementation of the sliding window attention pattern requires a form of banded matrix multiplication that is not supported in the existing deep learning libraries like PyTorch and Tensorflow, the authors also introduce a custom CUDA kernel for implementing these attention operations.
What’s the key achievement?
- The Longformer model achieves a new state of the art on character-level language modeling tasks:
- BPC of 1.10 on text8;
- BPC of 1.00 on enwik8.
- After pre-training and fine-tuning for six tasks, including classification, question answering, and coreference resolution, the Longformer-base consistently outperformers the RoBERTa-base with:
- accuracy of 75.0 vs. 72.4 on WikiHop;
- F1 score of 75.2 vs. 74.2 on TriviaQA;
- joint F1 score of 64.4 vs. 63.5 on HotpotQA;
- average F1 score of 78.6 vs. 78.4 on the OntoNotes coreference resolution task;
- accuracy of 95.7 vs. 95.3 on the IMDB classification task;
- F1 score of 94.0 vs. 87.4 on the Hyperpartisan classification task.
- The performance gains are especially remarkable for the tasks that require a long context (i.e., WikiHop and Hyperpartisan).
What are future research areas?
- Exploring other attention patterns that are more efficient due to dynamic adaptation to the input.
- Applying Longformer to other relevant long document tasks such as summarization.
What are possible business applications?
- The Longformer architecture can be very advantageous for the downstream NLP tasks that often require processing of long documents:
- document classification;
- question answering;
- coreference resolution;
- summarization;
- semantic search.
Where can you get implementation code?
- The code implementation of Longformer is open-sourced on GitHub.
5. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators, by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning
Original Abstract
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30× more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.
Our Summary
The pre-training task for popular language models like BERT and XLNet involves masking a small subset of unlabeled input and then training the network to recover this original input. Even though it works quite well, this approach is not particularly data-efficient as it learns from only a small fraction of tokens (typically ~15%). As an alternative, the researchers from Stanford University and Google Brain propose a new pre-training task called replaced token detection. Instead of masking, they suggest replacing some tokens with plausible alternatives generated by a small language model. Then, the pre-trained discriminator is used to predict whether each token is an original or a replacement. As a result, the model learns from all input tokens instead of the small masked fraction, making it much more computationally efficient. The experiments confirm that the introduced approach leads to significantly faster training and higher accuracy on downstream NLP tasks.
What’s the core idea of this paper?
- Pre-training methods that are based on masked language modeling are computationally inefficient as they use only a small fraction of tokens for learning.
- Researchers propose a new pre-training task called replaced token detection, where:
- some tokens are replaced by samples from a small generator network;
- a model is pre-trained as a discriminator to distinguish between original and replaced tokens.
- The introduced approach, called ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately):
- enables the model to learn from all input tokens instead of the small masked-out subset;
- is not adversarial, despite the similarity to GAN, as the generator producing tokens for replacement is trained with maximum likelihood.
What’s the key achievement?
- Demonstrating that the discriminative task of distinguishing between real data and challenging negative samples is more efficient than existing generative methods for language representation learning.
- Introducing a model that substantially outperforms state-of-the-art approaches while requiring less pre-training compute:
- ELECTRA-Small gets a GLUE score of 79.9 and outperforms a comparably small BERT model with a score of 75.1 and a much larger GPT model with a score of 78.8.
- An ELECTRA model that performs comparably to XLNet and RoBERTa uses only 25% of their pre-training compute.
- ELECTRA-Large outscores the alternative state-of-the-art models on the GLUE and SQuAD benchmarks while still requiring less pre-training compute.
What does the AI community think?
- The paper was selected for presentation at ICLR 2020, the leading conference in deep learning.
What are possible business applications?
- Because of its computational efficiency, the ELECTRA approach can make the application of pre-trained text encoders more accessible to business practitioners.
Where can you get implementation code?
- The original TensorFlow implementation and pre-trained weights are released on GitHub.
6. Language Models are Few-Shot Learners, by Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei
Original Abstract
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10× more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
Our Summary
The OpenAI research team draws attention to the fact that the need for a labeled dataset for every new language task limits the applicability of language models. Considering that there is a wide range of possible tasks and it’s often difficult to collect a large labeled training dataset, the researchers suggest an alternative solution, which is scaling up language models to improve task-agnostic few-shot performance. They test their solution by training a 175B-parameter autoregressive language model, called GPT-3, and evaluating its performance on over two dozen NLP tasks. The evaluation under few-shot learning, one-shot learning, and zero-shot learning demonstrates that GPT-3 achieves promising results and even occasionally outperforms the state of the art achieved by fine-tuned models.
What’s the core idea of this paper?
- The GPT-3 model uses the same model and architecture as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization.
- However, in contrast to GPT-2, it uses alternating dense and locally banded sparse attention patterns in the layers of the transformer, as in the Sparse Transformer.
- The model is evaluated in three different settings:
- Few-shot learning, when the model is given a few demonstrations of the task (typically, 10 to 100) at inference time but with no weight updates allowed.
- One-shot learning, when only one demonstration is allowed, together with a natural language description of the task.
- Zero-shot learning, when no demonstrations are allowed and the model has access only to a natural language description of the task.
What’s the key achievement?
- The GPT-3 model without fine-tuning achieves promising results on a number of NLP tasks, and even occasionally surpasses state-of-the-art models that were fine-tuned for that specific task:
- On the CoQA benchmark, 81.5 F1 in the zero-shot setting, 84.0 F1 in the one-shot setting, and 85.0 F1 in the few-shot setting, compared to the 90.7 F1 score achieved by fine-tuned SOTA.
- On the TriviaQA benchmark, 64.3% accuracy in the zero-shot setting, 68.0% in the one-shot setting, and 71.2% in the few-shot setting, surpassing the state of the art (68%) by 3.2%.
- On the LAMBADA dataset, 76.2 % accuracy in the zero-shot setting, 72.5% in the one-shot setting, and 86.4% in the few-shot setting, surpassing the state of the art (68%) by 18%.
- The news articles generated by the 175B-parameter GPT-3 model are hard to distinguish from real ones, according to human evaluations (with accuracy barely above the chance level at ~52%).
What does the AI community think?
- “The GPT-3 hype is way too much. It’s impressive (thanks for the nice compliments!) but it still has serious weaknesses and sometimes makes very silly mistakes. AI is going to change the world, but GPT-3 is just a very early glimpse. We have a lot still to figure out.” – Sam Altman, CEO and co-founder of OpenAI.
- “I’m shocked how hard it is to generate text about Muslims from GPT-3 that has nothing to do with violence… or being killed…” – Abubakar Abid, CEO and founder of Gradio.
- “No. GPT-3 fundamentally does not understand the world that it talks about. Increasing corpus further will allow it to generate a more credible pastiche but not fix its fundamental lack of comprehension of the world. Demos of GPT-4 will still require human cherry picking.” – Gary Marcus, CEO and founder of Robust.ai.
- “Extrapolating the spectacular performance of GPT3 into the future suggests that the answer to life, the universe and everything is just 4.398 trillion parameters.” – Geoffrey Hinton, Turing Award winner.
What are future research areas?
- Improving pre-training sample efficiency.
- Exploring how few-shot learning works.
- Distillation of large models down to a manageable size for real-world applications.
What are possible business applications?
- The model with 175B parameters is hard to apply to real business problems due to its impractical resource requirements, but if the researchers manage to distill this model down to a workable size, it could be applied to a wide range of language tasks, including question answering, dialog agents, and ad copy generation.
Where can you get implementation code?
- The code itself is not available, but some dataset statistics together with unconditional, unfiltered 2048-token samples from GPT-3 are released on GitHub.
7. Beyond Accuracy: Behavioral Testing of NLP models with CheckList, by Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh
Original Abstract
Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
Our Summary
The authors point out the shortcomings of existing approaches to evaluating performance of NLP models. A single aggregate statistic, like accuracy, makes it difficult to estimate where the model is failing and how to fix it. The alternative evaluation approaches usually focus on individual tasks or specific capabilities. To address the lack of comprehensive evaluation approaches, the researchers introduce CheckList, a new evaluation methodology for testing of NLP models. The approach is inspired by principles of behavioral testing in software engineering. Basically, CheckList is a matrix of linguistic capabilities and test types that facilitates test ideation. Multiple user studies demonstrate that CheckList is very effective at discovering actionable bugs, even in extensively tested NLP models.
What’s the core idea of this paper?
- Existing approaches to evaluation of NLP models have many significant shortcomings:
- The primary approach to the evaluation of models’ generalization capabilities, which is accuracy on held-out data, may lead to performance overestimation, as the held-out data often contains the same biases as the training data. Moreover, this single aggregate statistic doesn’t help much in figuring out where the NLP model is failing and how to fix these bugs.
- The alternative approaches are usually designed for evaluation of specific behaviors on individual tasks and thus, lack comprehensiveness.
- To address this problem, the research team introduces CheckList, a new methodology for evaluating NLP models, inspired by the behavioral testing in software engineering:
- CheckList provides users with a list of linguistic capabilities to be tested, like vocabulary, named entity recognition, and negation.
- Then, to break down potential capability failures into specific behaviors, CheckList suggests different test types, such as prediction invariance or directional expectation tests in case of certain perturbations.
- Potential tests are structured as a matrix, with capabilities as rows and test types as columns.
- The suggested implementation of CheckList also introduces a variety of abstractions to help users generate large numbers of test cases easily.
What’s the key achievement?
- Evaluation of state-of-the-art models with CheckList demonstrated that even though some NLP tasks are considered “solved” based on accuracy results, the behavioral testing highlights many areas for improvement.
- Applying CheckList to an extensively tested public-facing system for sentiment analysis showed that this methodology:
- helps to identify and test for capabilities not previously considered;
- results in more thorough and comprehensive testing for previously considered capabilities;
- helps to discover many more actionable bugs.
What does the AI community think?
- The paper received the Best Paper Award at ACL 2020, the leading conference in natural language processing.
What are possible business applications?
- CheckList can be used to create more exhaustive testing for a variety of NLP tasks.
- Such comprehensive testing that helps in identifying many actionable bugs is likely to lead to more robust NLP systems.
Where can you get implementation code?
- The code for testing NLP models with CheckList is available on GitHub.
8. Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics, by Nitika Mathur, Timothy Baldwin, Trevor Cohn
Original Abstract
Automatic metrics are fundamental for the development and evaluation of machine translation systems. Judging whether, and to what extent, automatic metrics concur with the gold standard of human evaluation is not a straightforward problem. We show that current methods for judging metrics are highly sensitive to the translations used for assessment, particularly the presence of outliers, which often leads to falsely confident conclusions about a metric’s efficacy. Finally, we turn to pairwise system ranking, developing a method for thresholding performance improvement under an automatic metric against human judgements, which allows quantification of type I versus type II errors incurred, i.e., insignificant human differences in system quality that are accepted, and significant human differences that are rejected. Together, these findings suggest improvements to the protocols for metric evaluation and system performance evaluation in machine translation.
Our Summary
The most recent Conference on Machine Translation (WMT) has revealed that, based on Pearson’s correlation coefficient, automatic metrics poorly match human evaluations of translation quality when comparing only a few best systems. Even negative correlations were exhibited in some instances. The research team from the University of Melbourne investigates this issue by studying the role of outlier systems, exploring how the correlation coefficient reflects different patterns of errors (type I vs. type II errors), and what magnitude of difference in the metric score corresponds to true improvements in translation quality as judged by humans. Their findings suggest that small BLEU differences (i.e., 1–2 points) have little meaning and other metrics, such as chrF, YiSi-1, and ESIM should be preferred over BLEU. However, only human evaluations can be a reliable basis for drawing important empirical conclusions.
What’s the core idea of this paper?
- Automatic metrics are used as a proxy for human translation evaluation, which is considerably more expensive and time-consuming.
- However, evaluating how well different automatic metrics concur with human evaluation is not a straightforward problem:
- For example, the recent findings show that if the correlation between leading metrics and human evaluations is computed using a large set of translation systems, it is typically very high (i.e., 0.9). However, if only a few best systems are considered, the correlation reduces markedly and can even be negative in some cases.
- The authors of this paper take a closer look at this problem and discover that:
- The identified problem with Pearson’s correlation is due to the small sample size and not specific to comparing strong MT systems.
- Outlier systems, whose quality is much higher or lower than the rest of the systems, have a disproportionate effect on the computed correlation and should be removed.
- The same correlation coefficient can reflect different patterns of errors. Thus, a better approach for gaining insights into metric reliability is to visualize metric scores against human scores.
- Small BLEU differences of 1-2 points correspond to true improvements in translation quality (as judged by humans) only in 50% of cases.
What’s the key achievement?
- Conducting a thorough analysis of automatic metrics performance metrics vs. human judgments in machine translation, and providing key recommendations on evaluating MT systems:
- Giving preference to such evaluation metrics as chrF, YiSi-1, and ESIM over BLEU and TER.
- Moving away from using small changes in evaluation metrics as the sole basis to draw important empirical conclusions, and always ensuring support from human evaluations before claiming that one MT system significantly outperforms another one.
What does the AI community think?
- The paper received an Honorable Mention at ACL 2020, the leading conference in natural language processing.
Where can you get implementation code?
- The implementation code, data, and additional analysis will be released on GitHub.
9. Towards a Human-like Open-Domain Chatbot, by Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, Quoc V. Le
Original Abstract
We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. We also propose a human evaluation metric called Sensibleness and Specificity Average (SSA), which captures key elements of a human-like multi-turn conversation. Our experiments show strong correlation between perplexity and SSA. The fact that the best perplexity end-to-end trained Meena scores high on SSA (72% on multi-turn evaluation) suggests that a human-level SSA of 86% is potentially within reach if we can better optimize perplexity. Additionally, the full version of Meena (with a filtering mechanism and tuned decoding) scores 79% SSA, 23% higher in absolute SSA than the existing chatbots we evaluated.
Our Summary
In contrast to most modern conversational agents, which are highly specialized, the Google research team introduces a chatbot Meena that can chat about virtually anything. It’s built on a large neural network with 2.6B parameters trained on 341 GB of text. The researchers also propose a new human evaluation metric for open-domain chatbots, called Sensibleness and Specificity Average (SSA), which can capture important attributes for human conversation. They demonstrate that this metric correlates highly with perplexity, an automatic metric that is readily available. Thus, the Meena chatbot, which is trained to minimize perplexity, can conduct conversations that are more sensible and specific compared to other chatbots. Particularly, the experiments demonstrate that Meena outperforms existing state-of-the-art chatbots by a large margin in terms of the SSA score (79% vs. 56%) and is closing the gap with human performance (86%).
What’s the core idea of this paper?
- Despite recent progress, open-domain chatbots still have significant weaknesses: their responses often do not make sense or are too vague or generic.
- To address these issues, the Google research team introduces Meena, a generative conversational model with 2.6B parameters trained on 40B words mined from public social media conversations:
- Meena is built on a seq2seq model with Evolved Transformer (ET) that includes 1 ET encoder block and 13 ET decoder blocks.
- The model is trained on multi-turn conversations with the input sequence including all turns of the context (up to 7) and the output sequence being the response.
- To measure the quality of open-domain chatbots, such as Meena, the researchers introduce a new human-evaluation metric, called Sensibleness and Sensitivity Average (SSA), that measures two fundamental aspects of a chatbot:
- making sense,
- being specific.
- The research team discovered that the SSA metric shows high negative correlation (R2 = 0.93) with perplexity, a readily available automatic metric that Meena is trained to minimize.
What’s the key achievement?
- Proposing a simple human-evaluation metric for open-domain chatbots.
- Demonstrating that a large-scale low-perplexity model can be a good conversationalist:
- The best end-to-end trained Meena model outperforms existing state-of-the-art open-domain chatbots by a large margin, achieving an SSA score of 72% (vs. 56%).
- Furthermore, the full version of Meena, with a filtering mechanism and tuned decoding, further advances the SSA score to 79%, which is not far from the 86% SSA achieved by the average human.
What does the AI community think?
- “Google’s “Meena” chatbot was trained on a full TPUv3 pod (2048 TPU cores) for 30 full days – that’s more than $1,400,000 of compute time to train this chatbot model.” – Elliot Turner, CEO and founder of Hyperia.
- “So I was browsing the results for the new Google chatbot Meena, and they look pretty OK (if boring sometimes). However, every once in a while it enters ‘scary sociopath mode,’ which is, shall we say, sub-optimal” – Graham Neubig, Associate professor at Carnegie Mellon University.
What are future research areas?
- Lowering the perplexity through improvements in algorithms, architectures, data, and compute.
- Considering other aspects of conversations beyond sensibleness and specificity, such as, for example, personality and factuality.
- Tackling safety and bias in the models.
What are possible business applications?
- The authors suggest some interesting applications for open-domain chatbots such as Meena:
- further humanizing computer interactions;
- improving foreign language practice;
- making interactive movie and videogame characters relatable.
Where can you get implementation code?
- Considering the challenges related to safety and bias in the models, the authors haven’t released the Meena model yet. However, they are still evaluating the risks and benefits and may decide otherwise in the coming months.
10. Recipes for Building an Open-Domain Chatbot, by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston
Original Abstract
Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that scaling neural models in the number of parameters and the size of the data they are trained on gives improved results, we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent persona. We show that large scale models can learn these skills when given appropriate training data and choice of generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models.
Our Summary
The Facebook AI Research team shows that with appropriate training data and generation strategy, large-scale models can learn many important conversational skills, such as engagingness, knowledge, empathy, and persona consistency. Thus, to build their state-of-the-art conversational agent, called BlenderBot, they leveraged a model with 9.4B parameters, trained it on a novel task called Blended Skill Talk, and deployed beam search with carefully selected hyperparameters as a generation strategy. Human evaluations demonstrate that BlenderBot outperforms Meena in pairwise comparison 75% to 25% in terms of engagingness and 65% to 35% in terms of humanness.
What’s the core idea of this paper?
- The introduced recipe for building a state-of-the-art open-domain chatbot includes three key ingredients:
- Large scale. The largest model has 9.4 billion parameters and was trained on 1.5 billion training examples of extracted conversations.
- Blended skills. The chatbot was trained on the Blended Skill Talk task to learn such skills as engaging use of personality, engaging use of knowledge, and display of empathy.
- Beam search used for decoding. The researchers show that this generation strategy, deployed with carefully selected hyperparameters, gives strong results. In particular, it was demonstrated that the lengths of the agent’s utterances is very important for chatbot performance (i.e, too short responses are often considered dull and too long responses make the chatbot appear to waffle and not listen).
What’s the key achievement?
- The introduced chatbot outperforms the previous best-performing open-domain chatbot Meena. Thus, in pairwise match-ups, BlenderBot with 2.7B parameters wins:
- 75% of the time in terms of engagingness;
- 65% of the time in terms of humanness.
- In an A/B comparison between human-to-human and human-to-BlenderBot conversations, the latter were preferred 49% of the time as more engaging.
What are future research areas?
- Addressing limitations of the introduced conversational agent, including:
- a lack of in-depth knowledge if sufficiently interrogated;
- a tendency to use simpler language;
- a tendency to repeat oft-used phrases.
- Further exploring unlikelihood training and retrieve-and-refine mechanisms as potential avenues for fixing these issues.
Where can you get implementation code?
- Facebook AI open-sourced BlenderBot by releasing code to fine-tune the conversational agent, the model weights, and code to evaluate it.
If you like these research summaries, you might be also interested in the following articles:
- 2020’s Top AI & Machine Learning Research Papers
- Novel Computer Vision Research Papers From 2020
- AAAI 2021: Top Research Papers With Business Applications
- ICLR 2021: Key Research Papers
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more summary articles like this one.
Leave a Reply
You must be logged in to post a comment.