This research summary is part of our Conversational AI series which covers the latest AI & machine learning approaches in the following areas:
In this last part of the series, we discuss the latest approaches to evaluation of dialog agents.
Evaluating open-domain dialog agents remains a very challenging problem for the NLP research community. In contrast to task-oriented dialogs, chit-chat conversations don’t have any explicit goal and there are many possible correct responses in each dialog turn.
Currently, the common approach is to apply automated evaluation metrics like BLEU, METEOR, or ROUGE during model development and then use human judgments to evaluate the final model. However, these automated metrics have many shortcomings and, most importantly, they correlate poorly with human judgments. At the same time, human evaluation is too expensive and time-consuming to be applied during model development.
To address this problem, the researchers introduce novel automated evaluation metrics that demonstrate a higher correlation with human judgments, while requiring no or very few human annotations. In this article, we summarize some of the most promising recently introduced approaches to the automated evaluation of open-domain conversational agents.
If these accessible AI research analyses & summaries are useful for you, you can subscribe to receive our regular industry updates below.
If you’d like to skip around, here are the papers we featured:
- Evaluating Coherence in Dialogue Systems using Entailment
- USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation
- Learning an Unreferenced Metric for Online Dialogue Evaluation
- Designing Precise and Robust Dialogue Response Evaluators
Novel Automated Evaluation Metrics for Open-Domain Dialogs
1. Evaluating Coherence in Dialogue Systems using Entailment, by Nouha Dziri, Ehsan Kamalloo, Kory W. Mathewson, Osmar Zaiane
Original Abstract
Evaluating open-domain dialogue systems is difficult due to the diversity of possible correct answers. Automatic metrics such as BLEU correlate weakly with human annotations, resulting in a significant bias across different models and datasets. Some researchers resort to human judgment experimentation for assessing response quality, which is expensive, time-consuming, and not scalable. Moreover, judges tend to evaluate a small number of dialogues, meaning that minor differences in evaluation configuration may lead to dissimilar results. In this paper, we present interpretable metrics for evaluating topic coherence by making use of distributed sentence representations. Furthermore, we introduce calculable approximations of human judgment based on conversational coherence by adopting state-of-the-art entailment techniques. Results show that our metrics can be used as a surrogate for human judgment, making it easy to evaluate dialogue systems on large-scale datasets and allowing an unbiased estimate for the quality of the responses.
Our Summary
The researchers from the University of Alberta suggest evaluating open-domain dialog systems by measuring the consistency of responses. They characterize the consistency of dialog systems as a natural language inference (NLI) problem. To convert the automatic evaluation into an NLI task, the authors treat the generated response as a hypothesis and the conversation history as a premise and focus on recognizing whether the response is inferred from the conversation history. They train the state-of-the-art inference models on the synthesized inference data built from conversational corpora. The comparison of the proposed evaluation method against existing automated metrics demonstrates that inference models are effective at evaluating dialog coherence.
What’s the core idea of this paper?
- Open-domain dialog systems are hard to evaluate as there is no explicit goal for a conversation:
- Existing automatic metrics, like BLEU, usually correlate weakly with human evaluations.
- Obtaining human judgments is expensive, time-consuming, and not scalable.
- To address these limitations, the researchers introduce calculable approximations of human judgment based on conversational coherence:
- They frame automated dialog evaluation as an entailment problem by considering inference about entailment as a useful testing bed for evaluating the coherence of dialog systems.
- To train the inference models, they build a synthesized dataset InferConvAI that is based on a Persona-Chat dialog corpus and contains 1.1M pairs of premises (conversation histories) and hypotheses (generated responses) that have one of three labels: entailment, contradiction, and neutral.
What’s the key achievement?
- Introducing a novel paradigm for evaluating the coherence of conversational agents and demonstrating that this approach:
- correlates reasonably with human judgments;
- is scalable as it doesn’t require human annotations.
What does the AI community think?
- The paper was accepted to NAACL-HLT 2019, one of the leading conferences in natural language processing.
What are future research areas?
- Exploring the potential approaches for measuring the engagingness of a conversation.
Where can you get implementation code?
- The implementation of the paper is released on GitHub.
2. USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation, by Shikib Mehri and Maxine Eskenazi
Original Abstract
The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48 and system-level: 1.0). USR additionally produces interpretable measures for several desirable properties of dialog.
Our Summary
The researchers from Carnegie Mellon University address the lack of meaningful and reliable evaluation metrics for open-domain dialog systems. To this end, they introduce an UnSupervised and Reference-free (USR) evaluation metric that is composed of multiple sub-metrics that evaluate specific qualities of open-domain dialog (e.g., natural, interesting, uses knowledge). The metric can be adapted to different tasks and datasets by removing, re-weighting, or adding desired properties of a dialog. The experiments demonstrate that the USR metric strongly correlates with human judgment.
What’s the core idea of this paper?
- The lack of effective automatic evaluation metrics for open-domain dialogs impedes the research progress in this area:
- Researchers can only rely on human evaluation, which is usually obtained only for the final model because of its time- and cost-intensive nature.
- As a result, during development, the models are optimized for poorly-correlated automatic metrics.
- To address this problem, the researchers introduce an UnSupervised and Reference-free (USR) evaluation metric for open-domain dialog:
- It consists of several sub-metrics that are combined into one measure of overall quality.
- Instead of relying on a ground-truth reference response, unsupervised models are trained to measure multiple desired properties of a dialog (e.g., understandable, interesting, natural, maintains context, uses knowledge).
- The metric can be personalized for particular use cases by adding, removing, or re-weighting specific dialog properties.
What’s the key achievement?
- Introducing a USR evaluation metric for open-domain dialog that:
- demonstrates high correlation with human judgment:
- on Topical-Chat, a turn-level Spearman correlation of 0.42 and a system-level Spearman correlation of 1.0;
- on Persona-Chat, a turn-level Spearman correlation of 0.48 and a system-level Spearman correlation of 1.0;
- produces interpretable measures for desirable properties of dialog.
- demonstrates high correlation with human judgment:
- Releasing the dataset with human-quality annotation for Amazon Topical-Chat and Persona-Chat to facilitate future benchmarking of dialog evaluation metrics.
What does the AI community think?
- The paper was accepted to ACL 2020, the leading conference in natural language processing.
What are possible business applications?
- USR is aimed at facilitating the development of open-domain dialog agents:
- It can be used for model selection and hyperparameter tuning.
- At the same time, the authors of the paper state that USR should not be used to claim superior performance of one method over another.
Where can you get implementation code?
- The PyTorch implementation of the paper is released on GitHub.
- The dataset with thorough human-quality annotations for Amazon Topical-Chat and Persona-Chat is available here.
3. Learning an Unreferenced Metric for Online Dialogue Evaluation, by Koustuv Sinha, Prasanna Parthasarathi, Jasmine Wang, Ryan Lowe, William L. Hamilton, Joelle Pineau
Original Abstract
Evaluating the quality of a dialogue interaction between two agents is a difficult task, especially in open-domain chit-chat style dialogue. There have been recent efforts to develop automatic dialogue evaluation metrics, but most of them do not generalize to unseen datasets and/or need a human-generated reference response during inference, making it infeasible for online evaluation. Here, we propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances, and leverages the temporal transitions that exist between them. We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.
Our Summary
The researchers address the problem of creating an evaluation metric for chit-chat dialog that would not require human annotation for reference but would still achieve a high correlation with human judgments. To this end, they introduce a completely unsupervised unreferenced metric MaUde (Metric for automatic Unreferenced dialog evaluation) that leverages a state-of-the-art pretrained language model (DistilBERT) to extract latent representations of utterances. Combined with a novel structure-aware text encoder and contrastive training method, these pretrained language models allow MaUde to achieve a high correlation with human judgments.
What’s the core idea of this paper?
- The paper proposes a new model, MaUde, for online unreferenced evaluation of open-domain dialog agents.
- It is inspired by the task of measuring alignment in natural language inference (NLI) but, instead of estimating entailment or contradiction like in typical NLI tasks, the researchers suggest measuring the quality of dialog responses.
- The introduced model has two key components:
- First, the researchers suggest training the model to differentiate between a correct response and a negative response using Noise Contrastive Estimation (NCE).
- Then, they introduce a specialized text encoder for MaUde that uses a BERT-based encoder and additionally models dialog transitions using a recurrent neural network.
What’s the key achievement?
- The experiments demonstrate that MaUde outperforms the baselines in terms of correlation with human judgments, especially with respect to dialog engagingness and interestingness.
What does the AI community think?
- The paper was accepted to ACL 2020, the leading conference in natural language processing.
What are future research areas?
- Leveraging MaUde to optimize and train better dialog generation models.
What are possible business applications?
- The introduced approach can be applied to evaluating online dialog conversations.
Where can you get implementation code?
- The code for reproducing the experiments is available on GitHub.
4. Designing Precise and Robust Dialogue Response Evaluators, by Tianyu Zhao, Divesh Lala, Tatsuya Kawahara
Original Abstract
Automatic dialogue response evaluators have been proposed as an alternative to automated metrics and human evaluation. However, existing automatic evaluators achieve only moderate correlation with human judgement and they are not robust. In this work, we propose to build a reference-free evaluator and exploit the power of semi-supervised training and pretrained (masked) language models. Experimental results demonstrate that the proposed evaluator achieves a strong correlation (> 0.6) with human judgement and generalizes robustly to diverse responses and corpora. We open-source the code and data at https://github.com/ZHAOTING/dialog-processing.
Our Summary
The research team from Kyoto University addresses the problem of building an automated evaluator that strongly correlates with human judgments and generalizes to unseen dialogs. In particular, they explore three methods for improving the performance of response evaluators: (1) using reference-free metrics; (2) combining unsupervised learning with fine-tuning on a small amount of annotated data; and (3) leveraging a pretrained language model (RoBERTa) for better text representations. The experiments demonstrate that these three methods combined allow the proposed evaluator to outperform other evaluators in terms of correlation with human judgments and robustness. The introduced evaluator also performs well in cross-domain and low-resource settings.
What’s the core idea of this paper?
- The authors suggest three methods for improving the performance of automatic dialog evaluators:
- Reference-free evaluation for improved robustness and diversity of responses:
- in contrast to reference-free metrics, referenced metrics have been shown to have very low standard deviation and poor performance when ground-truth responses were removed from the test data;
- Semi-supervised learning, i.e. applying unsupervised training first and then fine-tuning an evaluator with a relatively small amount of annotated data.
- RoBERTa-based text encoder for better text representations.
- Reference-free evaluation for improved robustness and diversity of responses:
What’s the key achievement?
- Comparison with several strong baselines demonstrates that the proposed evaluator:
- outperforms the alternative approaches by a large margin (Spearman correlation with human judgments of 0.66 vs. 0.41);
- generates responses that are close to human-generated responses in terms of diversity;
- generalizes well to a new corpus;
- can be trained efficiently with only 100 annotated samples, implying its applicability in low-resource settings.
What does the AI community think?
- The paper was accepted to ACL 2020, the leading conference in natural language processing.
Where can you get implementation code?
- The PyTorch implementation code and data are open-sourced on GitHub.
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more summary articles like this one.
Leave a Reply
You must be logged in to post a comment.