Evaluation is a crucial part of the dialog system development process. Human judgment is considered a gold standard for the evaluation of dialog agents. However, this is also a very expensive and time-intensive approach. Thus, researchers mostly rely on automatic metrics when developing dialog systems.
In this article, we’ll introduce a brief overview of the key evaluation methods for dialog systems. We’ll draw on the comprehensive Survey on Evaluation Methods for Dialog Systems prepared by Jan Deriu and his colleagues.
In their survey, Deriu et al. (2019) claim that a good evaluation method should have an automated and repeatable procedure with a high correlation to human judgments. It should be also able to differentiate between various dialog strategies as well as explain which features of the dialog system are important.
Let’s review the available metrics and see if they satisfy these requirements.
Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.
Open-domain dialogs
The major challenge with the evaluation of open-domain dialog systems comes from the one-to-many relationship between the user’s input and plausible responses. The available automatic metrics mostly do not solve this problem but they are still widely used when developing open-domain dialog models because human evaluations are prohibitively expensive to use at the model development stage. However, a good practice is to get human judgments for the evaluation of the final dialog model.
According to Deriu et al. (2019), human evaluation can be approached through:
- Lab experiments, where users are invited to the lab to interact with a dialog system and fill in a questionnaire afterwards. This approach was popular before crowdsourcing became widely available.
- In-field experiments, where feedback is collected from real users of a dialog system. This strategy allows user feedback to be gathered over a span of several months and was also used to judge the Alexa Prize.
- Crowdsourcing, when the human evaluation is performed using crowdsourcing platforms such as Amazon Mechanical Turk (AMT). This is the most popular strategy for human evaluation in current research.
At the development stage, the generative dialog models are usually evaluated using the following automatic metrics.
Perplexity measures how well a probabilistic model fits the data – the better the fit, the lower the perplexity. This metric is a strong indicator of whether the generated response is grammatical. Recently, the Google research team has also demonstrated that perplexity is strongly negatively correlated with the Sensibleness and Specificity Average (SSA) score, a human evaluation metric for open-domain chatbots.
The BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), and METEOR (Banerjee and Lavie, 2005) metrics measure the word overlap between the generated responses and the reference ones. Many researchers argue that these metrics are not appropriate for the evaluation of open-domain dialog agents since there are many plausible responses to the same user’s input, while the number of reference responses in a test set is always limited. Thus, Liu et al. (2016) showed in their paper that neither of the word-overlap-based scores has any correlation to human judgments.
ADEM (Lowe et al., 2017) is a recurrent neural network trained to predict appropriateness ratings by human evaluators. This trained metric has a significantly higher correlation with human judgments compared to word-overlap metrics but Sai et al. (2019) showed that it is not robust. Specifically, simple manipulations like reversing the generated response or replacing it with a dull dummy response often increase the predicted score.
Distinct-n measures the proportion of unique n-grams in a generated set of responses, making it a strong indicator of response diversity.
Topic-based metrics measure the system’s ability to have long and in-depth conversations on different topics. Topic depth is measured by the average length of a sub-conversation on a given topic, and topic breadth is measured by the total number of distinct topic keywords across all conversations.
As mentioned before, the usual approach is to apply automated evaluation metrics during model development and then use human judgments to evaluate the final model. However, as you can see, the automated metrics have many shortcomings and, most importantly, they correlate poorly with human judgments. See et al. (2019) suggest four attributes to evaluate the performance of open-domain dialog agents: repetition, specificity, response-relatedness, and question-asking. Automatic metrics that consider these attributes might have a much higher correlation with human evaluations.
If you’re interested to learn about the most recently introduced evaluation metrics for dialog agents, check out these research papers featuring the latest approaches to evaluating open-domain dialog systems.
Task-oriented dialogs
Evaluation of target-oriented dialog systems relies on the structured nature of the interaction. Basically, we need to evaluate two aspects of the dialog: (1) task success, i.e. whether the task was completed successfully, and (2) dialog efficiency, i.e. whether it was completed with the minimum number of turns.
The task success rate measures how well the dialog system fulfills the requirements set by a user. For example, this metric reflects whether the system’s recommendation satisfies all the user’s requests (price range, location, cuisine) or whether the dialog agent provided all the information requested by the user (e.g., departure time for a specific bus route, departure station).
One of the options is to measure the task-success rate via the confusion matrix, which shows errors made over a set of dialogs, and the Kappa coefficient applied to this confusion matrix.
Dialog efficiency measures the length of the dialog in the number of turns or the elapsed time. Deriu et al. (2019) also mention such intricate metrics as the number of inappropriate utterances or the number of turns required for a sub-dialog to fill a single slot.
Pipeline systems for task-oriented dialog can be evaluated using evaluation metrics employed on every subsystem. Thus, the following metrics are mentioned in the comprehensive survey by Deriu et al. (2019):
- Natural language understanding: Sentence Level Semantic Accuracy (SLSA), Slot Error Rate (SER) (also called Concept Error Rate (CER)), and F-measures.
- Dialog state tracking: accuracy and L2 metric.
- Natural language generation: F1 score to measure the correctness of the content and BLEU or ROUGE metric to measure the quality of the surface realization. Alternatively, human evaluations can be performed to measure the naturalness and quality of the generated responses.
Therefore, while the evaluation of task-oriented dialog systems can rely on the structured nature of the interaction, the evaluation of open-domain dialogs remains an open problem. In our research summaries, we’ve curated and featured the recent research papers that address this problem.
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more summary articles like this one.
Leave a Reply
You must be logged in to post a comment.