This research summary is part of our Conversational AI series which covers the latest AI & machine learning approaches in the following areas:
In this post, we review the recently introduced datasets for training, validating, and evaluating dialog systems.
Recent progress in language modeling and natural language generation has resulted in more sophisticated chatbots, both chit-chat and goal-oriented. However, dialog agents these days are still very limited in their ability to have human-like conversations. The NLP research community is working on ideas for novel architectures and approaches to improve the performance of conversational agents.
However, many of the limitations in the performance of today’s chatbots come from the lack of properly designed and collected dialog corpora. There is a noticeable gap between existing dialog datasets and real-life human conversations. The datasets cover a limited number of domains, focus on only one or a few skills (e.g., empathy, persona consistency), include many unrealistic constraints, etc.
Researchers are continuously working on designing, collecting, and annotating new dialog corpora that should help with the existing challenges. In this article, we summarize the research papers that introduce some of the most useful novel datasets for training and evaluating open-domain and task-oriented dialog systems.
If these accessible AI research analyses & summaries are useful for you, you can subscribe to receive our regular industry updates below.
If you’d like to skip around, here are the papers we featured:
- MuTual: A Dataset for Multi-Turn Dialogue Reasoning
- The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents
- Can You Put it All Together: Evaluating Conversational Agents’ Ability to Blend Skills
- Multi-Domain Goal-Oriented Dialogues (MultiDoGO): Strategies toward Curating and Annotating Large Scale Dialogue Data
- Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset
- CrossWOZ: A LargeScale Chinese Cross-Domain Task-Oriented Dialogue Dataset
Open-Domain and Task-Oriented Dialog Datasets
1. MuTual: A Dataset for Multi-Turn Dialogue Reasoning, by Leyang Cui, Yu Wu, Shujie Liu, Yue Zhang, Ming Zhou
Original Abstract
Non-task oriented dialogue systems have achieved great success in recent years due to largely accessible conversation data and the development of deep learning techniques. Given a context, current systems are able to yield a relevant and fluent response, but sometimes make logical mistakes because of weak reasoning capabilities. To facilitate the conversation reasoning research, we introduce MuTual, a novel dataset for Multi-Turn dialogue Reasoning, consisting of 8,860 manually annotated dialogues based on Chinese student English listening comprehension exams. Compared to previous benchmarks for non-task oriented dialogue systems, MuTual is much more challenging since it requires a model that can handle various reasoning problems. Empirical results show that state-of-the-art methods only reach 71%, which is far behind the human performance of 94%, indicating that there is ample room for improving reasoning ability. MuTual is available at https://github.com/Nealcly/MuTual.
Our Summary
The researchers from Microsoft Research Asia draw attention to the fact that real-world chatbots often generate logically incorrect responses, implying that current dialog systems may lack reasoning skills. To address this problem, they introduce an open-domain Multi-Turn dialogue reasoning (MuTual) dataset that is based on Chinese high school English listening comprehension test data, where students need to select the best answer out of several options, given a multi-turn dialog and a question. The dataset consists of 8860 questions with four response candidates that are all relevant to the context but only one is logically correct. Evaluation of state-of-the-art models on the MuTual dataset demonstrates that the best method achieves a recall at position 1 (R@1) of 71%, which is significantly below human performance (94%), implying that even the leading models lack reasoning abilities.
What’s the core idea of this paper?
- Despite the noticeable progress in dialog agents, real-world chatbots often generate responses that violate commonsense knowledge and are logically incorrect. That is partially due to the existing benchmarks containing tasks that can be solved without reasoning but by linguistic information matching only.
- To address this problem, the authors introduce an open-domain Multi-Turn dialogue reasoning (MuTual) dataset:
- It contains 8860 challenge questions where the model needs to select the only logically correct answer from four context-relevant candidates based on the given multi-turn dialog.
- All the questions involve reasoning and are designed by expert linguists and high-quality annotators.
What’s the key achievement?
- Introducing the first human-labeled reasoning-based dataset for multi-turn dialog.
- Demonstrating that even state-of-the-art models like BERT and RoBERTa are far behind human performance on the MuTual dataset, indicating that the current language models don’t have sufficiently strong reasoning capabilities.
What does the AI community think?
- The paper was accepted to ACL 2020, the leading conference in natural language processing.
What are possible business applications?
- The MuTual dataset can facilitate the designing of open-domain chatbots with stronger reasoning skills and thus fewer illogical responses.
Where can you get a dataset?
- The dataset has been released on GitHub.
2. The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents, by Kurt Shuster, Da Ju, Stephen Roller, Emily Dinan, Y-Lan Boureau, Jason Weston
Original Abstract
We introduce dodecaDialogue: a set of 12 tasks that measures if a conversational agent can communicate engagingly with personality and empathy, ask questions, answer questions by utilizing knowledge resources, discuss topics and situations, and perceive and converse about images. By multi-tasking on such a broad large-scale set of data, we hope to both move towards and measure progress in producing a single unified agent that can perceive, reason and converse with humans in an open-domain setting. We show that such multi-tasking improves over a BERT pre-trained baseline, largely due to multi-tasking with very large dialogue datasets in a similar domain, and that the multi-tasking in general provides gains to both text and image-based tasks using several metrics in both the fine-tune and task transfer settings. We obtain state-of-the-art results on many of the tasks, providing a strong baseline for this challenge.
Our Summary
To support human-like conversations, open-domain chatbots should demonstrate a number of different properties such as being knowledgeable, personable, engaging, and capable of answering and asking questions, as well as grounding the dialog on external sources and images. The Facebook AI Research team claims that no single task exists that can train a dialog agent and measure its ability on all these properties. Therefore, they introduce dodecaDialogue, a new challenging task that consists of 12 subtasks. The researchers also propose a model that can be trained on all these subtasks. The experiments demonstrate that after pre-training on the largest of the subtasks and then multi-tasking on all of them, this model gets state-of-the-art results on all 10 subtasks that have previous results for comparison.
What’s the core idea of this paper?
- A good open-domain dialog agent should be capable of many different properties, including knowledgeability, personability, ability to ground responses on external knowledge and images, and ability to ask and answer questions.
- To build such capable dialog agents, the authors suggest assembling 12 disparate large-scale datasets in a single challenge called dodecaDialogue. The subtasks include:
- Conv AI2
- DailyDialog
- Wizard of Wikipedia
- EmpatheticDialogues
- Cornell Movie
- LIGHT
- ELI5
- Ubuntu
- pushshift.io Reddit
- Image Chat
- Image Grounded Conversations (IGC)
- The paper also introduces a model capable of training and multi-tasking on all these sources:
- It is based on Transformer architecture.
- It gets an image, external textual data, and dialog history as input and outputs a response for a given dialog turn.
What’s the key achievement?
- Setting a strong baseline for the dodecaDialogue challenge by introducing a multi-task model that outperforms the existing state-of-the-art approaches on 10 out of 12 subtasks, with the other two subtasks having no previous results for reference.
- Demonstrating that training dialog agents on datasets that are closely linked to the desired agent’s goals is a strong alternative to large-scale pre-training on general text corpora.
What does the AI community think?
- The paper was accepted to ACL 2020, the leading research conference in natural language processing.
What are future research areas?
- Exploring architectures that can handle multi-tasking even better.
- Adding more tasks to the suggested ones (e.g., having longer conversations involving memory, or mixing open-domain conversations with goal-oriented ones).
What are possible business applications?
- Building open-domain conversational agents capable of multiple skills.
Where can you get implementation code?
- The best models that have been introduced and evaluated in the paper are released here.
3. Can You Put it All Together: Evaluating Conversational Agents’ Ability to Blend Skills, by Eric Michael Smith, Mary Williamson, Kurt Shuster, Jason Weston, Y-Lan Boureau
Original Abstract
Being engaging, knowledgeable, and empathetic are all desirable general qualities in a conversational agent. Previous work has introduced tasks and datasets that aim to help agents to learn those qualities in isolation and gauge how well they can express them. But rather than being specialized in one single quality, a good open-domain conversational agent should be able to seamlessly blend them all into one cohesive conversational flow. In this work, we investigate several ways to combine models trained towards isolated capabilities, ranging from simple model aggregation schemes that require minimal additional training, to various forms of multi-task training that encompass several skills at all training stages. We further propose a new dataset, BlendedSkillTalk, to analyze how these capabilities would mesh together in a natural conversation, and compare the performance of different architectures and training schemes. Our experiments show that multi-tasking over several tasks that focus on particular capabilities results in better blended conversation performance compared to models trained on a single skill, and that both unified or two-stage approaches perform well if they are constructed to avoid unwanted bias in skill selection or are fine-tuned on our new task.
Our Summary
This is another research paper from the Facebook AI Research team investigating the problem of building an open-domain chatbot with multiple skills. In particular, the authors examine how to combine such traits as (1) the ability to provide and request personal details, (2) knowledgeability, and (3) empathy. First, they try to train a model separately on these three skills by using the ConvAI2, Wizard of Wikipedia, and EmpatheticDialogues datasets. However, when the model is trained this way it may still struggle to blend the different skills seamlessly over the course of a single conversation. Therefore, the researchers introduce BlendedSkillTalk, a novel dataset of about 5K dialogs, where crowd-sourced workers were instructed to be knowledgeable, empathetic, and give personal details whenever appropriate. The experiments demonstrate that single-skill tasks can be effective in training a model that blends all skills into a single dialog agent if care is taken to make this agent avoid unwanted biases in skill selection, or if fine-tuning on a blended dataset, or both.
What’s the core idea of this paper?
- Exploring the most effective strategies for building an open-domain conversational agent that can display multiple various qualities over the course of a single conversation:
- training an agent on multiple single-skill datasets (ConvAI2, Wizard of Wikipedia, and EmpatheticDialogues);
- introducing approaches to avoid unwanted biases when selecting the skill;
- fine-tuning a model on blended data.
- Introducing a new English-language dataset, BlendedSkillTalk, which combines several skills into a single conversation:
- The dataset contains 4,819 dialogs in the training set, 1,009 dialogs in the validation set, and 980 dialogs in the test set.
- On average, every conversation in the training set has 11.2 utterances.
What’s the key achievement?
- Demonstrating that the model pre-trained on the pushshift.io Reddit dataset, multi-tasked on three single-skill datasets during fine-tuning, and then fine-tuned on the introduced blended dataset outperforms the models trained on a single skill.
- Introducing approaches to mitigate biases when blending and selecting multiple skills.
What does the AI community think?
- The paper was accepted to ACL 2020, the leading research conference in natural language processing.
What are future research areas?
- Extending the findings of this paper to other skills (e.g., humor, eloquence, image commenting).
What are possible business applications?
- Building an open-domain dialog agent capable of multiple skills such as the ability to provide and request personal details, knowledgeability, and empathy.
Where can you get a dataset?
- The dataset is available through the ParlAI framework.
4. Multi-Domain Goal-Oriented Dialogues (MultiDoGO): Strategies toward Curating and Annotating Large Scale Dialogue Data, by Denis Peskov, Nancy Clarke, Jason Krone, Brigi Fodor, Yi Zhang, Adel Youssef, Mona Diab
Original Abstract
The need for high-quality, large-scale, goal-oriented dialogue datasets continues to grow as virtual assistants become increasingly wide-spread. However, publicly available datasets useful for this area are limited either in their size, linguistic diversity, domain coverage, or annotation granularity. In this paper, we present strategies toward curating and annotating large scale goal-oriented dialogue data. We introduce the MultiDoGO dataset to overcome these limitations. With a total of over 81K dialogues harvested across six domains, MultiDoGO is over 8 times the size of MultiWOZ, the other largest comparable dialogue dataset currently available to the public. Over 54K of these harvested conversations are annotated for intent classes and slot labels. We adopt a Wizard-of-Oz approach wherein a crowd-sourced worker (the “customer”) is paired with a trained annotator (the “agent”). The data curation process was controlled via biases to ensure a diversity in dialogue flows following variable dialogue policies. We provide distinct class label tags for agents vs. customer utterances, along with applicable slot labels. We also compare and contrast our strategies on annotation granularity, i.e. turn vs. sentence level. Furthermore, we compare and contrast annotations curated by leveraging professional annotators vs the crowd. We believe our strategies for eliciting and annotating such a dialogue dataset scales across modalities and domains and potentially languages in the future. To demonstrate the efficacy of our devised strategies we establish neural baselines for classification on the agent and customer utterances as well as slot labeling for each domain.
Our Summary
The Amazon AWS AI researchers address the common issues with task-oriented dialog datasets, like limited size, linguistic diversity, domain coverage, and annotation granularity, and introduce the MultiDoGO dataset to overcome these limitations. The dataset comprises over 86K conversations of which 54,818 conversations are annotated at the turn level. It was collected using the Wizard-of-Oz approach, with crowd-sourced workers acting as customers and trained annotators acting as agents. The researchers also suggest distinguishing between dialog utterances for agents vs. customers: the agents’ utterances are annotated with generic class labels common across all domains, while customers’ utterances are labeled with intent classes and appropriate slot labels. The introduced dataset is larger, more controlled, and covers more domains than other comparable datasets.
What’s the core idea of this paper?
- The authors claim that a task-oriented dialog dataset should have different annotations for agents’ vs. customers’ utterances as agents are incentivized to communicate within a set procedure in a patient and professional tone, while customers do not have such an incentive.
- To this end, they introduce their strategies and approaches for curating and annotating a large-scale multi-domain task-oriented dialog dataset:
- The dataset is gathered with the Wizard-of-Oz technique, where crowd-sourced workers act as customers and trained annotators act as agents.
- To ensure diversity in dialog flows, the participants are explicitly guided to engage in conversations with specific biases (e.g., intent change, multiple slot values, slot-overfilling).
- The agents’ utterances are annotated with generic class labels common across all domains, while customers’ utterances are labeled with intent classes and appropriate slot labels.
- The paper also presents a dataset gathered using the introduced approaches and strategies, the MultiDoGO dataset, which:
- comprises 86K conversations, of which 54,818 conversations are annotated at the turn level and 15,000 are annotated at both the turn and sentence levels;
- covers six domains, including airline, fast food, finance, insurance, media, and software support.
What’s the key achievement?
- Introducing a task-oriented dialog dataset that is several orders of magnitude larger, more diverse, and more controlled than the existing alternatives.
What does the AI community think?
- The paper was accepted for oral presentation at EMNLP 2019, one of the leading research conferences in natural language processing.
What are future research areas?
- Scaling the proposed data collection and annotation methodology to other languages.
What are possible business applications?
- The dataset can be used to train task-oriented conversational agents for:
- booking airline flights or changing flight details;
- ordering food;
- opening bank accounts or checking customers’ balances;
- requesting the fulfillment of insurance policies;
- ordering a service or paying bills related to telecommunications;
- inquiring about software services, products, promotions, and bills.
Where can you get a dataset?
- The MultiDoGO dataset is available on GitHub.
5. Towards Scalable Multi-domain Conversational Agents: The Schema-Guided Dialogue Dataset, by Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, Pranav Khaitan
Original Abstract
Virtual assistants such as Google Assistant, Alexa and Siri provide a conversational interface to a large number of services and APIs spanning multiple domains. Such systems need to support an ever-increasing number of services with possibly overlapping functionality. Furthermore, some of these services have little to no training data available. Existing public datasets for task-oriented dialogue do not sufficiently capture these challenges since they cover few domains and assume a single static ontology per domain. In this work, we introduce the Schema-Guided Dialogue (SGD) dataset, containing over 16k multi-domain conversations spanning 16 domains. Our dataset exceeds the existing task-oriented dialogue corpora in scale, while also highlighting the challenges associated with building large-scale virtual assistants. It provides a challenging testbed for a number of tasks including language understanding, slot filling, dialogue state tracking and response generation. Along the same lines, we present a schema-guided paradigm for task-oriented dialogue, in which predictions are made over a dynamic set of intents and slots, provided as input, using their natural language descriptions. This allows a single dialogue system to easily support a large number of services and facilitates simple integration of new services without requiring additional training data. Building upon the proposed paradigm, we release a model for dialogue state tracking capable of zero-shot generalization to new APIs, while remaining competitive in the regular setting.
Our Summary
Existing task-oriented dialog systems do not sufficiently capture the challenges associated with building large-scale virtual assistants like Google Assistant, Alexa, or Siri, which need to support a constantly increasing number of services and APIs across multiple domains and with often overlapping functionality. To address the existing challenges, the Google Research team introduces the Schema-Guided Dialogue (SGD) dataset with over 16K dialogs in the training set spanning 26 services that belong to 16 different domains. To test the models’ ability to generalize in zero-shot settings, the evaluation sets contain unseen services and domains. Next, the authors also introduce a schema-guided paradigm for task-oriented dialog that enables effective sharing of knowledge among all services by relating semantically similar concepts across APIs. A pretrained-based model for multi-domain dialog state tracking that is built under this paradigm successfully generalizes to unseen services and is robust to API changes.
What’s the core idea of this paper?
- Addressing the challenges associated with building large-scale virtual assistants by presenting a Schema-Guided Dialogue (SGD) dataset:
- The training set includes 16K conversations spanning 26 services across 16 different domains.
- The evaluation set covers 4 additional domains and contains many services, and consequently slots, that are not present in the training set, to evaluate model performance on unseen services and domains.
- Building this dataset in the following steps:
- creating synthetic implementations of 45 services or APIs over 20 domains;
- using the simulator framework to interact with these services and generate dialog outlines;
- applying a crowd-sourcing procedure to paraphrase these outlines to natural language, while preserving all annotations obtained from the simulator.
- Proposing a schema-guided paradigm for task-oriented dialog, with the aim of building a single unified dialog model for all services and APIs:
- A service’s schema is leveraged by the model to make predictions over the dynamic set of intents and slots present in the schema.
- Knowledge can be effectively shared among all services by relating semantically similar concepts across APIs.
What’s the key achievement?
- Introducing a large-scale multi-domain task-oriented dialog dataset for training and testing virtual assistants for intent prediction, slot filling, state tracking, and language generation, among other tasks.
- Proposing a pretrained-based model for multi-domain dialog state tracking built under the schema-guided paradigm and demonstrating that this model:
- generalizes to unseen services;
- achieves competitive results on the WOZ 2.0 and MultiWOZ 2.1 datasets.
What does the AI community think?
- The paper was accepted for oral presentation at AAAI 2020, one of the leading conferences in artificial intelligence.
What are future research areas?
- Exploring the possibilities for carrying out a comprehensive side-by-side comparison with alternative datasets, considering that the SGD dataset covers many new domains at a large scale.
Where can you get a dataset?
- The dataset has been released on GitHub.
6. CrossWOZ: A LargeScale Chinese Cross-Domain Task-Oriented Dialogue Dataset, by Qi Zhu, Kaili Huang, Zheng Zhang, Xiaoyan Zhu, Minlie Huang
Original Abstract
To advance multi-domain (cross-domain) dialogue modeling as well as alleviate the shortage of Chinese task-oriented datasets, we propose CrossWOZ, the first large-scale Chinese Cross-Domain Wizard-of-Oz task-oriented dataset. It contains 6K dialogue sessions and 102K utterances for 5 domains, including hotel, restaurant, attraction, metro, and taxi. Moreover, the corpus contains rich annotation of dialogue states and dialogue acts at both user and system sides. About 60% of the dialogues have cross-domain user goals that favor inter-domain dependency and encourage natural transition across domains in conversation. We also provide a user simulator and several benchmark models for pipelined task-oriented dialogue systems, which will facilitate researchers to compare and evaluate their models on this corpus. The large size and rich annotation of CrossWOZ make it suitable to investigate a variety of tasks in cross-domain dialogue modeling, such as dialogue state tracking, policy learning, user simulation, etc.
Our Summary
The researchers from Tsinghua University point out that the dialogs in the existing task-oriented datasets lack the smoothness of cross-domain transition compared to real-life human conversations. Moreover, there is still no well-recognized Chinese task-oriented dialog dataset. To address these issues, the authors introduce CrossWOZ, a large-scale Chinese multi-domain corpus for task-oriented dialog. The dataset contains 6K sessions and 102K utterances for 5 domains (attraction, restaurant, hotel, metro, and taxi) with natural and challenging cross-domain dependencies. The experiments demonstrate that cross-domain constraints in the CrossWOZ dataset are challenging for the existing models, implying that the introduced dataset is likely to enhance cross-domain dialog modeling.
What’s the core idea of this paper?
- Existing task-oriented dialog datasets are mostly far behind real-life conversations in terms of complexity and cross-domain transition:
- Humans usually transition naturally between different domains while still maintaining coherent contexts.
- Existing corpora, on the other hand, are either single-domain or cover multiple domains but with dependency across domains being simply embodied in imposing the same prespecified constraints on different domains (like in the MultiWOZ dataset).
- To address this issue, as well as the lack of Chinese-language task-oriented dialog datasets, the authors introduce CrossWOZ, a large-scale Chinese multi-domain dataset for task-oriented dialog:
- It consists of 6K sessions and 102K utterances for 5 domains (attraction, restaurant, hotel, metro, and taxi).
- A choice in one domain affects the choices in related domains (e.g., the hotel should be close to the attraction chosen by the user in previous turns).
- Dialog states and dialog acts are annotated for both the system side and the user side. Annotation of user states allows for the development of more elaborate user simulators.
What’s the key achievement?
- Introducing the first large-scale Chinese dataset for task-oriented dialog systems.
- Providing benchmark models for different tasks, including natural language understanding, dialog state tracking, dialogue policy learning, and natural language generation.
- Providing a user simulator to facilitate the development and evaluation of dialog models on the introduced dataset.
What does the AI community think?
- The paper was accepted by TACL and presented at ACL 2020, the leading conference in natural language processing.
What are possible business applications?
- The CrossWOZ dataset can enhance the development of multi-domain task-oriented chatbots.
Where can you get a dataset?
- The dataset and the benchmark models are released on GitHub.
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more summary articles like this one.
Leave a Reply
You must be logged in to post a comment.