ACL is the leading conference in the field of natural language processing (NLP), covering a broad spectrum of research areas in computational linguistics. Due to the COVID-19 risks, ACL 2020 took place 100% virtually, similar to other big academic conferences of this year.
However, as always, it was the best place to learn about the latest NLP research trends and cutting-edge research papers in language modeling, conversational AI, machine translation, and other NLP research topics.
Following the long-standing tradition, the best paper awards were announced during the last day of the main conference. In this article, we’ve summarized the key research ideas of the papers that received the Best Paper Award and Honorable Mentions at ACL 2020.
Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.
If you’d like to skip around, here are the papers we featured:
- Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
- Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks
- Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics
ACL 2020 Best Paper Awards
1. Beyond Accuracy: Behavioral Testing of NLP models with CheckList, by Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh
Original Abstract
Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
Our Summary
The authors point out the shortcomings of existing approaches to evaluating performance of NLP models. A single aggregate statistic, like accuracy, makes it difficult to estimate where the model is failing and how to fix it. The alternative evaluation approaches usually focus on individual tasks or specific capabilities. To address the lack of comprehensive evaluation approaches, the researchers introduce CheckList, a new evaluation methodology for testing of NLP models. The approach is inspired by principles of behavioral testing in software engineering. Basically, CheckList is a matrix of linguistic capabilities and test types that facilitates test ideation. Multiple user studies demonstrate that CheckList is very effective at discovering actionable bugs, even in extensively tested NLP models.
What’s the core idea of this paper?
- Existing approaches to evaluation of NLP models have many significant shortcomings:
- The primary approach to the evaluation of models’ generalization capabilities, which is accuracy on held-out data, may lead to performance overestimation, as the held-out data often contains the same biases as the training data. Moreover, this single aggregate statistic doesn’t help much in figuring out where the NLP model is failing and how to fix these bugs.
- The alternative approaches are usually designed for evaluation of specific behaviors on individual tasks and thus, lack comprehensiveness.
- To address this problem, the research team introduces CheckList, a new methodology for evaluating NLP models, inspired by the behavioral testing in software engineering:
- CheckList provides users with a list of linguistic capabilities to be tested, like vocabulary, named entity recognition, and negation.
- Then, to break down potential capability failures into specific behaviors, CheckList suggests different test types, such as prediction invariance or directional expectation tests in case of certain perturbations.
- Potential tests are structured as a matrix, with capabilities as rows and test types as columns.
- The suggested implementation of CheckList also introduces a variety of abstractions to help users generate large numbers of test cases easily.
What’s the key achievement?
- Evaluation of state-of-the-art models with CheckList demonstrated that even though some NLP tasks are considered “solved” based on accuracy results, the behavioral testing highlights many areas for improvement.
- Applying CheckList to an extensively tested public-facing system for sentiment analysis showed that this methodology:
- helps to identify and test for capabilities not previously considered;
- results in a more thorough and comprehensive testing for previously considered capabilities;
- helps to discover many more actionable bugs.
What does the AI community think?
- The paper received the Best Paper Award at ACL 2020, the leading conference in natural language processing.
What are possible business applications?
- CheckList can be used to create more exhaustive testing for a variety of NLP tasks.
- Such comprehensive testing that helps in identifying many actionable bugs is likely to lead to more robust NLP systems.
Where can you get implementation code?
- The code for testing NLP models with CheckList is available on GitHub.
2. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks, by Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, Noah A. Smith
Original Abstract
Language models pretrained on text from a wide variety of sources form the foundation of today’s NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains, under both high- and low-resource settings. Moreover, adapting to the task’s unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multi-phase adaptive pretraining offers large gains in task performance.
Our Summary
The research team from the Allen Institute for Artificial Intelligence investigates whether the leading language models trained on massive heterogeneous corpora work universally or whether it is still useful to build separate models pretrained for specific domains. They address this question by considering four different domains and eight classification tasks, spanning low- and high-resource settings. Furthermore, they consider domain-adaptive pretraining as well as task-adaptive pretraining. The findings of the researchers suggest that both pretraining approaches consistently improve the performance of RoBERTa, one of the leading language models. They also show that manual curation of datasets for specific tasks further enhances model performance.
What’s the core idea of this paper?
- Today’s leading language models are trained on massive heterogeneous datasets and achieve strong performance across many tasks. At the same time, the benefits of domain-specific or task-specific pretraining are not well investigated.
- The research team addresses this question for one of the leading language models, RoBERTa:
- They consider four domains (biomedical research papers, computer science papers, news, and reviews) and eight classification tasks (two in each domain), in both high- and low-resource settings.
- The experiments cover continued pretraining on the domain, known as domain-adaptive pretraining (DAPT), and pretraining on a directly task-relevant corpus, known as task-adaptive pretraining (TAPT).
- Additionally, the researchers study the benefits of datasets for task-adaptive pretraining being manually curated by task designers.
What’s the key achievement?
- Demonstrating the importance of domain-specific and task-specific pretraining. The experiments show that:
- Domain-adaptive pretraining consistently improves performance on tasks from the target domain, in both low- and high-resource settings.
- Task-adaptive pretraining significantly boosts the performance of the language model, with or without domain-adaptive pretraining.
- Benefits from task-adaptive pretraining increase with additional unlabeled data that has been manually curated by task designers or annotators.
What does the AI community think?
- The paper received an Honorable Mention at ACL 2020, the leading conference in natural language processing.
What are future research areas?
- The authors suggest the following directions for future research:
- better data selection for task-adaptive pretraining;
- efficient adaptation of large pretrained language models to distant domains;
- building reusable language models after adaptation.
What are possible business applications?
- The approaches studied in this paper can be applied to any pretrained language model to further improve its performance in specific domains and for specific NLP tasks.
Where can you get implementation code?
- The implementation code as well as pretrained models for multiple domains and tasks are publicly available on GitHub.
3. Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics, by Nitika Mathur, Timothy Baldwin, Trevor Cohn
Original Abstract
Automatic metrics are fundamental for the development and evaluation of machine translation systems. Judging whether, and to what extent, automatic metrics concur with the gold standard of human evaluation is not a straightforward problem. We show that current methods for judging metrics are highly sensitive to the translations used for assessment, particularly the presence of outliers, which often leads to falsely confident conclusions about a metric’s efficacy. Finally, we turn to pairwise system ranking, developing a method for thresholding performance improvement under an automatic metric against human judgements, which allows quantification of type I versus type II errors incurred, i.e., insignificant human differences in system quality that are accepted, and significant human differences that are rejected. Together, these findings suggest improvements to the protocols for metric evaluation and system performance evaluation in machine translation.
Our Summary
The most recent Conference on Machine Translation (WMT) has revealed that, based on Pearson’s correlation coefficient, automatic metrics poorly match human evaluations of translation quality when comparing only a few best systems. Even negative correlations were exhibited in some instances. The research team from the University of Melbourne investigates this issue by studying the role of outlier systems, exploring how the correlation coefficient reflects different patterns of errors (type I vs. type II errors), and what magnitude of difference in the metric score corresponds to true improvements in translation quality as judged by humans. Their findings suggest that small BLEU differences (i.e., 1–2 points) have little meaning and other metrics, such as chrF, YiSi-1, and ESIM should be preferred over BLEU. However, only human evaluations can be a reliable basis for drawing important empirical conclusions.
What’s the core idea of this paper?
- Automatic metrics are used as a proxy for human translation evaluation, which is considerably more expensive and time-consuming.
- However, evaluating how well different automatic metrics concur with human evaluation is not a straightforward problem:
- For example, the recent findings show that if the correlation between leading metrics and human evaluations is computed using a large set of translation systems, it is typically very high (i.e., 0.9). However, if only a few best systems are considered, the correlation reduces markedly and can even be negative in some cases.
- The authors of this paper take a closer look at this problem and discover that:
- The identified problem with Pearson’s correlation is due to the small sample size and not specific to comparing strong MT systems.
- Outlier systems, whose quality is much higher or lower than the rest of the systems, have a disproportionate effect on the computed correlation and should be removed.
- The same correlation coefficient can reflect different patterns of errors. Thus, a better approach for gaining insights into metric reliability is to visualize metric scores against human scores.
- Small BLEU differences of 1-2 points correspond to true improvements in translation quality (as judged by humans) only in 50% of cases.
What’s the key achievement?
- Conducting a thorough analysis of automatic metrics vs. human judgments in machine translation, and providing key recommendations on evaluating MT systems:
- Giving preference to such evaluation metrics as chrF, YiSi-1, and ESIM over BLEU and TER.
- Moving away from using small changes in evaluation metrics as the sole basis to draw important empirical conclusions, and always ensuring support from human evaluations before claiming that one MT system significantly outperforms another one.
What does the AI community think?
- The paper received an Honorable Mention at ACL 2020, the leading conference in natural language processing.
Where can you get implementation code?
- The implementation code, data, and additional analysis will be released on GitHub.
If you like these research summaries, you might be also interested in the following articles:
- Reformer, Longformer, and ELECTRA: Key Updates To Transformer Architecture In 2020
- 8 Leading Language Models For NLP In 2020
- We Summarized 14 NLP Research Breakthroughs You Can Apply To Your Business
- What Are Major NLP Achievements & Papers From 2019?
- What Every NLP Engineer Needs To Know About Pre-Trained Language Models
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more summary articles like this one.
Leave a Reply
You must be logged in to post a comment.