10 Important Research Papers In Conversational AI From 2019

Conversational AI is becoming an integral part of business practice across industries. More companies are adopting the advantages chatbots bring to customer service, sales, and marketing.

Even though chatbots are becoming a “must-have” asset for leading businesses, their performance is still very far from human. Researchers from major research institutions and tech leaders have explored ways to boost the performance of dialog systems by increasing the diversity of their responses, enabling emotion recognition, improving their ability to track long-term aspects of the conversation, ensuring the maintenance of a consistent persona, etc.

We’ve searched through important conversational AI research papers published in 2019 to present you the top 10 that set the new state-of-the-art in both task-oriented and open-domain dialog systems.

Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.

If you’d like to skip around, here are the papers we featured:

Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems
Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study
Persuasion for Good: Towards a Personalized Persuasive Dialogue System for Social Good
OpenDialKG: Explainable Conversational Reasoning with Attention-based Walks over Knowledge Graphs
A Dynamic Speaker Model for Conversational Interactions
Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems
Jointly Optimizing Diversity and Relevance in Neural Response Generation
Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack
Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset
Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations

10 Important Conversational AI Research Papers of 2019

1. Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems, by Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-Asl, Caiming Xiong, Richard Socher, Pascale Fung

Original Abstract

Over-dependence on domain ontology and lack of knowledge sharing across domains are two practical and yet less studied problems of dialogue state tracking. Existing approaches generally fall short in tracking unknown slot values during inference and often have difficulties in adapting to new domains. In this paper, we propose a Transferable Dialogue State Generator (TRADE) that generates dialogue states from utterances using a copy mechanism, facilitating knowledge transfer when predicting (domain, slot, value) triplets not encountered during training. Our model is composed of an utterance encoder, a slot gate, and a state generator, which are shared across domains. Empirical results demonstrate that TRADE achieves state-of-the-art joint goal accuracy of 48.62% for the five domains of MultiWOZ, a human-human dialogue dataset. In addition, we show its transferring ability by simulating zero-shot and few-shot dialogue state tracking for unseen domains. TRADE achieves 60.58% joint goal accuracy in one of the zero-shot domains, and is able to adapt to few-shot cases without forgetting already trained domains.

Our Summary

The research team from the Hong Kong University of Science and Technology and Salesforce Research addresses the problem of over-dependence on domain ontology and lack of knowledge sharing across domains. In a practical scenario, many slots share all or some of their values among different domains (e.g., the area slot can exist in many domains like restaurant, hotel, or taxi), and thus transferring knowledge across multiple domains is imperative for dialogue state tracking (DST) models. The researchers introduce a TRAnsferable Dialogue statE generator (TRADE) that leverages its context-enhanced slot gate and copy mechanism to track slot values mentioned anywhere in a dialogue history. TRADE shares its parameters across domains and doesn’t require a predefined ontology, which enables tracking of previously unseen slot values. The experiments demonstrate the effectiveness of this approach with TRADE achieving state-of-the-art joint goal accuracy of 48.62% on a challenging MultiWOZ dataset.

What’s the core idea of this paper?

To overcome over-dependence on domain ontology and lack of knowledge sharing across domains, the researchers suggest:
- generating slot values directly instead of predicting the probability of every predefined ontology term;
- sharing all the model parameters across domains.
The TRADE model consists of three components:
- an utterance encoder to encode dialogue utterances into a sequence of fixed-length vectors;
- a slot gate to predict whether a certain (domain, slot) pair is triggered by the dialogue;
- a state generator to decode multiple output tokens for all (domain, slot) pairs independently to predict their corresponding values.

What’s the key achievement?

On a challenging MultiWOZ dataset of human-human dialogues, TRADE achieves joint goal accuracy of 48.62%, setting a new state of the art.
Moreover, TRADE achieves 60.58% joint goal accuracy in one of the zero-shot domains, demonstrating its ability to transfer knowledge to previously unseen domains.
The experiments also demonstrate the model’s ability to adapt to new few-shot domains without forgetting already trained domains.

What does the AI community think?

The paper received an Outstanding Paper award at the main ACL 2019 conference and the Best Paper Award at NLP for Conversational AI Workshop at the same conference.

What are future research areas?

Transferring knowledge from other resources to further improve zero-shot performance.
Collecting a dataset with a large number of domains to facilitate the study of techniques within multi-domain dialogue state tracking.

What are possible business applications?

The current research can significantly improve the performance of task-oriented dialogue systems in multi-domain settings.

Where can you get implementation code?

The PyTorch implementation of this study is available on GitHub.

2. Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study, by Chinnadhurai Sankar, Sandeep Subramanian, Chris Pal, Sarath Chandar, and Yoshua Bengio

Original Abstract

Neural generative models have become increasingly popular when building conversational agents. They offer flexibility, can be easily adapted to new domains, and require minimal domain engineering. A common criticism of these systems is that they seldom understand or use the available dialog history effectively. In this paper, we take an empirical approach to understanding how these models use the available dialog history by studying the sensitivity of the models to artificially introduced unnatural changes or perturbations to their context at test time. We experiment with 10 different types of perturbations on 4 multi-turn dialog datasets and find that commonly used neural dialog architectures like recurrent and transformer-based seq2seq models are rarely sensitive to most perturbations such as missing or reordering utterances, shuffling words, etc. Also, by open-sourcing our code, we believe that it will serve as a useful diagnostic tool for evaluating dialog systems in the future.

Our Summary

The recently introduced generative models are good at generating fluent responses, but these responses tend to be boring and repetitive, which is often attributed to these models’ poor understanding of dialog history. In this paper, the authors try to investigate empirically whether neural generative models use dialog history effectively. To this end, they test the sensitivity of recurrent networks and transformer-based sequence-to-sequence models to a variety of synthetic perturbations. The approach is based on the assumption that if a model is insensitive to perturbations which destroy some types of information, that means the model was making minimal use of this information. The experiments demonstrate that neural generative models are insensitive to utterance-level and word-level perturbations.

What’s the core idea of this paper?

Studying empirically the behavior of generative neural systems in the presence of synthetically introduced perturbations to the dialog history.
Experimenting with:
- utterance-level perturbations (shuffling the sequence of utterances, reversing the order of utterances, dropping certain utterances, truncating the dialog history to contain only the k most recent utterances);
- word-level perturbations (word-shuffling, reversing the ordering of words, dropping 30% of words uniformly, dropping all nouns, dropping all verbs).

What’s the key achievement?

Demonstrating empirically that:
- Models tend to show only tiny changes in perplexity even under extreme changes to the dialog history, suggesting that they don’t use the information available to them effectively.
- Transformers are insensitive to word-reordering, implying that they could be learning bag-of-words-like representations.
- The attention mechanisms result in models using more information from the earlier parts of the dialog.
- Compared to recurrent models, transformers are less sensitive to perturbations that destroy conversational dynamics across utterances.

What does the AI community think?

The paper has been nominated as a candidate for the ACL 2019 Best Paper Award.

What are future research areas?

Model sensitivity to perturbations destroying certain types of information within the dialog history can be useful:
- for understanding the kinds of information leveraged by models to solve new dialog datasets;
- as a diagnostic tool for new models.

Where can you get implementation code?

The code is available on GitHub.

3. Persuasion for Good: Towards a Personalized Persuasive Dialogue System for Social Good, by Xuewei Wang, Weiyan Shi, Richard Kim, Yoojung Oh, Sijia Yang, Jingwen Zhang, and Zhou Yu

Original Abstract

Developing intelligent persuasive conversational agents to change people’s opinions and actions for social good is the frontier in advancing the ethical development of automated dialogue systems. To do so, the first step is to understand the intricate organization of strategic disclosures and appeals employed in human persuasion conversations. We designed an online persuasion task where one participant was asked to persuade the other to donate to a specific charity. We collected a large dataset with 1,017 dialogues and annotated emerging persuasion strategies from a subset. Based on the annotation, we built a baseline classifier with context information and sentence-level features to predict the 10 persuasion strategies used in the corpus. Furthermore, to develop an understanding of personalized persuasion processes, we analyzed the relationships between individuals’ demographic and psychological backgrounds including personality, morality, value systems, and their willingness for donation. Then, we analyzed which types of persuasion strategies led to a greater amount of donation depending on the individuals’ personal backgrounds. This work lays the ground for developing a personalized persuasive dialogue system.

Our Summary

The paper builds on the Elaboration Likelihood Model, which argues that persuasive messages are more effective when they are tailored to people’s worldviews. The authors recruited participants from Mechanical Turk, psychologically profiled them, and then asked them to role-play persuading each other to donate to the charity Save the Children. The researchers then annotated a subset of the conversations according to which persuasion strategies were used, and built a hybrid RCNN classifier to classify the whole corpus into different persuasion strategies. They analyzed the most successful strategies overall and the interactions between persuasion strategies and the users’ demographics and personality types.

What’s the core idea of this paper?

The purpose of AI conversational agents often involves persuasion in some form, but research into effective persuasion strategies for them has been limited because the study of persuasion is traditionally part of social science rather than AI engineering.
This interdisciplinary paper builds on sociological foundations to identify different persuasion strategies in a large corpus of human chat conversations and analyzes which ones are most effective for people in general and for given personality types.
To classify the persuasion strategies into 10 categories plus one additional “non-strategy” class, the authors have developed the hybrid RCNN model with the following features:
- sentence embedding;
- context embedding;
- sentence-level features.

Overview of the hybrid RCNN classifier

What’s the key achievement?

The researchers have discovered that:
- Offering practical donation information was the best strategy overall.
- Asking the user if they’re familiar with the charity significantly increased the donation probability for participants high in the Big Five trait of Openness.
- Asking personal questions significantly increased the donation probability for users who subscribe to Care or Freedom (in Haidt’s moral foundations theory), but reduced the donation probability for those who subscribe to Fairness or Authority.

What does the AI community think?

The paper has been nominated as a candidate for the ACL 2019 Best Paper Award.

What are future research areas?

Improving the performance of the classifier through including more annotations and more dialog context.
Designing an adaptive persuasive dialog system with the ability to choose persuasion strategies based on the user’s profile.

What are possible business applications?

The authors note that technology is double-edged and has the potential to persuade people for good or evil. Bearing this in mind, the knowledge presented in the paper could be used to inform ethical design foundations for automated dialog systems.
As per the example in the paper, it can be used by charities to persuade potential supporters to donate.
It can also be used in products aimed at helping people fulfill their personal goals.

Where can you get implementation code?

The authors have released the dataset and code at GitLab.

4. OpenDialKG: Explainable Conversational Reasoning with Attention-based Walks over Knowledge Graphs, by Seungwhan Moon, Pararth Shah, Anuj Kumar, and Rajen Subba

Original Abstract

We study a conversational reasoning model that strategically traverses through a large-scale common fact knowledge graph (KG) to introduce engaging and contextually diverse entities and attributes. For this study, we collect a new Open-ended Dialog ↔ KG parallel corpus called OpenDialKG, where each utterance from 15K human-to-human roleplaying dialogs is manually annotated with ground-truth reference to corresponding entities and paths from a large-scale KG with 1M+ facts. We then propose the DialKG Walker model that learns the symbolic transitions of dialog contexts as structured traversals over KG, and predicts natural entities to introduce given previous dialog contexts via a novel domain-agnostic, attention-based graph path decoder. Automatic and human evaluations show that our model can retrieve more natural and human-like responses than the state-of-the-art baselines or rule-based models, in both in-domain and cross-domain tasks. The proposed model also generates a KG walk path for each entity retrieved, providing a natural way to explain conversational reasoning.

Our Summary

The Facebook Conversational AI team introduces a novel approach to creating natural, human-like responses with engaging, contextually diverse information about different entities and their attributes. First of all, they collect a new parallel corpus OpenDialKG where each mention of an entity in a conversation is manually linked with its corresponding ground-truth KG path. Then, the authors introduce a new model called DialKG Walker that can learn knowledge paths among entities mentioned in the conversation as well as reasons grounded on a large commonsense knowledge graph. The experiments on several benchmarks demonstrate that the suggested approach generates more natural and human-like responses than the state-of-the-art baselines.

What’s the core idea of this paper?

The research team wants to enable open-ended dialog systems to understand conversational contexts and respond by introducing relevant entities and attributes.
For this purpose, they use a large-scale knowledge graph (100K entities and 1.1M facts) and a data-driven reasoning model that maps dialog transitions to KG paths to identify a subset of entities that will be relevant to mention within a specific dialog context.
Particularly, the research team proposes a new model, called DialKG Walker, that includes:
- an attention-based graph decoder that walks an optimal path within a large commonsense KG to effectively prune unlikely candidate entities;
- a parallel zero-shot learning model that leverages previous sentence, dialog, and KG context to rank candidate entities based on their relevance and path scores.
With the large-scale common fact knowledge graph, the introduced approach enables domain-agnostic conversational reasoning in open-ended conversations across various domains and tasks.

What’s the key achievement?

Automatic and human evaluations demonstrate that DialKG Walker outperforms the state-of-the-art baselines and rule-based models.
Extensive cross-domain and transfer learning evaluations confirm the flexibility of the presented approach.
The introduced OpenDialKG dataset with 91K utterances across 15K dialog sessions allows for an in-depth study of symbolic reasoning and natural language conversations.

What does the AI community think?

The paper has been nominated as a candidate for the ACL 2019 Best Paper Award.

What are possible business applications?

Enabling chatbots across different domains and tasks (e.g., chitchat, recommendations) to generate more natural and engaging responses by mentioning relevant entities and attributes.

5. A Dynamic Speaker Model for Conversational Interactions, by Hao Cheng, Hao Fang, Mari Ostendorf

Original Abstract

Individual differences in speakers are reflected in their language use as well as in their interests and opinions. Characterizing these differences can be useful in human-computer interaction, as well as analysis of human-human conversations. In this work, we introduce a neural model for learning a dynamically updated speaker embedding in a conversational context. Initial model training is unsupervised, using context-sensitive language generation as an objective, with the context being the conversation history. Further fine-tuning can leverage task-dependent supervised training. The learned neural representation of speakers is shown to be useful for content ranking in a socialbot and dialog act prediction in human-human conversations.

Our Summary

The research team from the University of Washington suggests that learning individual differences between users can be useful for predicting the next dialog acts. To detect these individual differences, the researchers propose using an unsupervised neural model, which learns speaker embedding from the dialog history. Furthermore, to capture changes over time and improve the speaker representation when new data is available, the introduced model is structured to allow dynamic updates of the speaker embedding vector at each dialog turn. The empirical results demonstrate that the model with dynamic speaker embeddings outperforms the baselines in predicting user topic decisions in human-socialbot conversations and classifying dialog acts in human-human dialogs.

The dynamic speaker model

What’s the core idea of this paper?

The authors introduce a model for learning a neural representation of speakers:
- The model is trained in an unsupervised manner to learn a representation of each speaker by relying only on that speaker’s conversation history.
- A learnable component for analyzing the latent modes of the speaker is incorporated into the model to help align the learned features of the speaker with the human-interpretable characteristics.
The Dynamic Speaker Model consists of three components:
- a latent mode analyzer to read an utterance and analyze its latent modes;
- a speaker state tracker to accumulate speaker information during the conversation;
- a speaker language predictor to reconstruct the utterance using the corresponding speaker state.
The learned dynamic speaker embeddings can be directly used as features or fine-tuned for a particular downstream task.

What’s the key achievement?

The introduced model achieves promising results in such downstream tasks as:
- user topic decision prediction in human-socialbot conversations;
- dialog act classification in human-human conversations.
The analysis of the learned latent modes shows that the model captures such speaker characteristics as intent, speaking style, and gender.

What does the AI community think?

The paper was presented at NAACL-HLT 2019, one of the most important conferences in natural language processing.

What are future research areas?

Guiding latent modes with some examples to select particular personality traits.

What are possible business applications?

Learning individual differences between speakers, as suggested in this paper, can be incorporated into different NLP systems to improve their performance in language understanding, language generation, human-chatbot interaction, query completion, etc.

Where can you get implementation code?

The authors provide the implementation code on GitHub.

6. Approximating Interactive Human Evaluation with Self-Play for Open-Domain Dialog Systems, by Asma Ghandeharioun, Judy Hanwen Shen, Natasha Jaques, Craig Ferguson, Noah Jones, Agata Lapedriza, Rosalind Picard

Original Abstract

Building an open-domain conversational agent is a challenging problem. Current evaluation methods, mostly post-hoc judgments of single-turn evaluation, do not capture conversation quality in a realistic interactive context. In this paper, we investigate interactive human evaluation and provide evidence for its necessity; we then introduce a novel, model-agnostic, and dataset-agnostic method to approximate it. In particular, we propose a self-play scenario where the dialog system talks to itself and we calculate a combination of proxies such as sentiment and semantic coherence on the conversation trajectory. We show that this metric is capable of capturing the human-rated quality of a dialog model better than any automated metric known to-date, achieving a significant Pearson correlation (r>.7, p<.05). To investigate the strengths of this novel metric and interactive evaluation in comparison to state-of-the-art metrics and one-turn evaluation, we perform extended experiments with a set of models, including several that make novel improvements to recent hierarchical dialog generation architectures through sentiment and semantic knowledge distillation on the utterance level. Finally, we open-source the interactive evaluation platform we built and the dataset we collected to allow researchers to efficiently deploy and evaluate generative dialog models.

Our Summary

The MIT research team investigates the problem of evaluating open-domain dialog systems. Humans are the ultimate authority in evaluating the quality of dialog systems, but getting human ratings is usually quite an expensive and difficult process. To overcome these challenges and still get reliable evaluations, the MIT team introduces a novel framework to estimate a dialog quality score, which has a high and statistically significant correlation with human ratings. Specifically, they propose a series of psychology-motivated metrics and then fit a function to predict human evaluation of conversation quality given these metrics. Bot quality is evaluated through self-play over a fixed number of turns, in which the bot generates utterances that are fed back as input in the next turn. The experiments confirm that the introduced self-play framework, together with psychology-motivated automated metrics, provides a good proxy for conversation assessment.

What’s the core idea of this paper?

The researchers demonstrate that single-turn evaluation doesn’t capture common failures of open-dialog systems, including lack of diversity in responses, failure to track long-term aspects of the conversation, and inability to maintain a consistent persona.
To ensure interactive multi-turn evaluation, they propose a series of automated metrics, including sentiment, semantic, and engagement metrics.
These metrics are computed on the conversation of the bot with itself, then combined using linear regression, and optimized to best predict human assessment of conversation quality.
The authors also suggest improvements to several hierarchical seq2seq generative models by regularizing the top level of the hierarchy to ensure it encodes the sentiment and semantics of the conversation.

Illustration of suggested regularization (blue) applied to the Variational Hierarchical Recurrent Encoder Decoder baseline (red)

What’s the key achievement?

Introducing a metric that, in a self-play framework, provides results that are strongly correlated with human assessments with regard to:
- bot empathy (r>0.8);
- conversation quality (r>0.7).

What does the AI community think?

The paper was accepted at NeurIPS 2019, the leading conference in artificial intelligence.

What are possible business applications?

Following this research paper, the MIT team provides an interactive evaluation platform to help researchers and practitioners evaluate and further improve their dialog systems.

Where can you get implementation code?

The Pytorch implementation of all the models mentioned in the paper is provided on GitHub.
To interact with the models go to http://neural.chat.
The Reddit dataset with 109K conversations that was used in the research is also publicly available.

7. Jointly Optimizing Diversity and Relevance in Neural Response Generation, by Xiang Gao, Sungjin Lee, Yizhe Zhang, Chris Brockett, Michel Galley, Jianfeng Gao, Bill Dolan

Original Abstract

Although recent neural conversation models have shown great potential, they often generate bland and generic responses. While various approaches have been explored to diversify the output of the conversation model, the improvement often comes at the cost of decreased relevance. In this paper, we propose a SpaceFusion model to jointly optimize diversity and relevance that essentially fuses the latent space of a sequence-to-sequence model and that of an autoencoder model by leveraging novel regularization terms. As a result, our approach induces a latent space in which the distance and direction from the predicted response vector roughly match the relevance and diversity, respectively. This property also lends itself well to an intuitive visualization of the latent space. Both automatic and human evaluation results demonstrate that the proposed approach brings significant improvement compared to strong baselines in both diversity and relevance.

Our Summary

The Microsoft Research team addresses diversity in chatbots’ responses. In particular, they introduce an approach to improving the diversity of responses without decreasing their relevance. Their SpaceFusion model leverages a sequence-to-sequence model for producing a predicted response vector and an autoencoder model for generating vectors for potential responses. By using the same decoder for both models and training them jointly end-to-end with new regularization terms, the researchers basically fuse two latent spaces of these models and create a structured latent space, where the relevance and diversity of the response can be controlled by adjusting the distance and direction from a predicted response vector, respectively. Both automated metrics and human evaluations confirm the effectiveness of the proposed approach in terms of improving the diversity and relevance of the responses.

Distance and direction from the predicted response vector given the context roughly match the relevance and diversity, respectively

What’s the core idea of this paper?

The paper introduces the SpaceFusion model, which is aimed at generating diverse and relevant responses based on a context:
- The sequence-to-sequence model produces the predicted response vector (the black dot on the picture above).
- The autoencoder generates the vectors for potential responses (the colored dots on the picture).
- These models share the latent space because of using the same decoder and joint end-to-end training with regularization.
Regularization is necessary for aligning the latent spaces:
- The interpolation regularization term prevents semantically different responses from aligning in the same direction.
- The fusion regularization term ensures homogeneous distribution of the vectors produced by the two models.
In the induced latent space, the diversity of responses can be regulated by varying the direction, and the relevance of responses can be controlled by adjusting the distance from the predicted response vector.

What’s the key achievement?

According to the automatic evaluations, the SpaceFusion model significantly outperforms strong baselines in terms of precision, recall, and F1 score.
According to the human evaluations, the proposed model generates responses that are more relevant and interesting than responses generated by other systems, but not those generated by humans.
The properties of the SpaceFusion model enable intuitive visualization of the latent space.

What does the AI community think?

The paper was presented at NAACL-HLT 2019, one of the key conferences in natural language processing.
The follow-up work of the authors on stylized response generation (StyleFusion) was accepted to EMNLP 2019, another important conference in natural language processing.

What are future research areas?

Justifying theoretically the effectiveness of the proposed regularization terms.
Generalizing the introduced model to generate stylized responses.

What are possible business applications?

The SpaceFusion model can boost the performance of chatbots by enabling the generation of more diverse and still relevant responses.

Where can you get implementation code?

The implementation of the SpaceFusion model is available on GitHub.

8. Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack, by Emily Dinan, Samuel Humeau, Bharath Chintagunta, Jason Weston

Original Abstract

The detection of offensive language in the context of a dialogue has become an increasingly important application of natural language processing. The detection of trolls in public forums, and the deployment of chatbots in the public domain are two examples that show the necessity of guarding against adversarially offensive behavior on the part of humans. In this work, we develop a training scheme for a model to become robust to such human attacks by an iterative build it, break it, fix it strategy with humans and models in the loop. In detailed experiments we show this approach is considerably more robust than previous systems. Further, we show that offensive language used within a conversation critically depends on the dialogue context, and cannot be viewed as a single sentence offensive detection task as in most previous work. Our newly collected tasks and methods will be made open source and publicly available.

Our Summary

Adversarial attacks on the Tay chatbot caused its language generation component to emit offensive language. The dialog system behind Tay was not able to withstand these attacks, and developers eventually shut it down. In this paper, the Facebook AI research team studies the detection of offensive language in conversation by using models that are robust to adversarial human attacks. Specifically, they develop a fully automated approach to the “Build It Break It Fix It” strategy: crowdworkers are used as humans-in-the-loop, models are fixed by retraining, and the whole build, break, and fix sequence is repeated over a number of iterations. The experiments demonstrate the robustness of this approach as compared to existing systems.

What’s the core idea of this paper?

To build models that are robust to adversarial behavior, the Facebook AI research team suggests the following algorithm:
- Build it: build a model that can detect offensive comments. The authors used the BERT-based model trained on the Wikipedia Toxic Comments dataset.
- Break it: ask the crowdworkers to submit messages that the worker finds offensive but the system marks as safe.
- Fix it: re-train the model on these new examples.
- Repeat: repeat the break it and fix it phases several times.
To make the model more robust to adversarial attacks, the authors suggest focusing on offensive utterances in the dialog context instead of classifying single utterances, because a phrase can be totally innocent on its own but extremely offensive in the context of dialog history.

What’s the key achievement?

The evaluation of the presented build it, break it, fix it approach demonstrates that:
- it is considerably more robust to adversarial attacks than existing systems;
- the adversarial data collected with this approach has far more nuanced language than existing datasets, includes less profanity, and is instead offensive because of figurative language, negation, and applying world knowledge;
- using contextual information from the dialog history to identify offensive language is critical for making the system robust to adversarial human attacks.

What does the AI community think?

The paper was accepted for oral presentation at EMNLP 2019, one of the key conferences in natural language processing.

What are future research areas?

Going beyond the binary problem definition (safe or offensive) and considering classes of offensive language separately.
Exploring other dialog tasks, such as social media or forums.
Investigating how the introduced approach can make neural generative models safe.

What are possible business applications?

The presented approach can be incorporated into different dialogue systems to make them more robust to adversarial human attacks.

Where can you get implementation code?

The authors are going to open-source the code for the entire build it, break it, fix it algorithm, as well as release collected data and trained models. However, these resources are not available at the moment of this publication.

9. Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset, by Bill Byrne, Karthik Krishnamoorthi, Chinnadhurai Sankar, Arvind Neelakantan, Daniel Duckworth, Semih Yavuz, Ben Goodrich, Amit Dubey, Andy Cedilnik, Kyu-Young Kim

Original Abstract

A significant barrier to progress in data-driven approaches to building dialog systems is the lack of high quality, goal-oriented conversational data. To help satisfy this elementary requirement, we introduce the initial release of the Taskmaster-1 dataset which includes 13,215 task-based dialogs comprising six domains. Two procedures were used to create this collection, each with unique advantages. The first involves a two-person, spoken “Wizard of Oz” (WOz) approach in which trained agents and crowdsourced workers interact to complete the task while the second is “self-dialog” in which crowdsourced workers write the entire dialog themselves. We do not restrict the workers to detailed scripts or to a small knowledge base and hence we observe that our dataset contains more realistic and diverse conversations in comparison to existing datasets. We offer several baseline models including state of the art neural seq2seq architectures with benchmark performance as well as qualitative human evaluations. Dialogs are labeled with API calls and arguments, a simple and cost effective approach which avoids the requirement of complex annotation schema. The layer of abstraction between the dialog model and the service provider API allows for a given model to interact with multiple services that provide similar functionally. Finally, the dataset will evoke interest in written vs. spoken language, discourse patterns, error handling and other linguistic phenomena related to dialog system research, development and design.

Our Summary

In this paper, the Google AI research team addresses the lack of high-quality goal-oriented conversational data. To tackle this problem, they introduce Taskmaster-1, a dataset with 13215 dialogs across six domains, collected with two distinct procedures. Some of the dialogs were collected through a web-based interface, where crowdsourced workers playing “users” were communicating with human operators but were led to believe they were interacting with the automated system. The rest of the dialogs were written by crowdsourced workers based on a suggested scenario. Taskmaster-1 has richer, more diverse language and involves more real-world entities than MultiWOZ, the currently popular benchmark dataset.

What’s the core idea of this paper?

The lack of high-quality goal-oriented dialog datasets is considered a major hindrance to significant progress in dialog generation and understanding.
To address this need in the NLP community, the Google AI team presents the Taskmaster-1 dataset:
- The dataset consists of 13215 dialogs:
  - 5507 spoken dialogs were collected through a Wizard of Oz system, where crowdsourced workers playing “users” interacted with human operators playing “digital assistants”. To imitate a real-world scenario, the users were led to believe they were interacting with an automated system, so that they would express their turns naturally but would keep in mind that they were speaking with a bot.
  - 7708 written dialogs were created by crowdsourced workers who were writing the full conversation themselves based on the outlined scenarios.
- The dataset covers six domains: ordering pizza, creating auto repair appointments, setting up a ride service, ordering movie tickets, ordering coffee drinks, and making restaurant reservations.
The researchers used simple API calls and arguments to label only the variables required to execute the transaction (e.g., movie name, time, number of tickets).

What’s the key achievement?

Introducing a dataset with richer and more diverse language than currently available task-oriented benchmark datasets (e.g., MultiWOZ).
Evaluating the performance of several strong baseline models on Taskmaster-1 using automatic evaluation metrics and qualitative human evaluations.

What does the AI community think?

The paper was selected for oral presentation at EMNLP 2019, one of the most important conferences in natural language processing.

What are possible business applications?

The introduced dataset can be used to boost the performance of goal-oriented chatbots and digital assistants.

Where can you get implementation code?

The Taskmaster-1 dialogue dataset is available on the Google AI website.

10. Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations, by Karthik Gopalakrishnan, Behnam Hedayatnia, Qinlang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Venkatesh, Raefer Gabriel, Dilek Hakkani-Tur

Original Abstract

Building socialbots that can have deep, engaging open-domain conversations with humans is one of the grand challenges of artificial intelligence (AI). To this end, bots need to be able to leverage world knowledge spanning several domains effectively when conversing with humans who have their own world knowledge. Existing knowledge-grounded conversation datasets are primarily stylized with explicit roles for conversation partners. These datasets also do not explore depth or breadth of topical coverage with transitions in conversations. We introduce Topical-Chat, a knowledge-grounded human-human conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners don’t have explicitly defined roles, to help further research in open-domain conversational AI. We also train several state-of-the-art encoder-decoder conversational models on Topical-Chat and perform automated and human evaluation for benchmarking.

Our Summary

To help build socialbots that can have deep engaging conversations across multiple domains, the Amazon Alexa AI team introduces Topical-Chat, a knowledge-grounded dataset of ~11K human-human conversations spanning eight broad topics. The dataset was collected with the help of crowdworkers who were asked to have naturally coherent conversations grounded in the provided reading sets. The conversational partners were also asked to annotate their dialogs on several dimensions, such as reading set utilization and sentiment. To create baselines for future research, the researchers trained several strong models on their dataset. The models have demonstrated their ability to generate engaging responses based on a reading set and conditioned on dialog history.

What’s the core idea of this paper?

The paper introduces Topical-Chat, a knowledge-grounded human-human conversation dataset containing over 235K utterances across eight domains, such as fashion, politics, books, sports, general entertainment, music, science & technology, and movies.
The dataset was collected in partnership with Amazon Mechanical Turk workers in the following manner:
- The workers were provided with topical reading sets.
- Then, they were asked to hold natural and engaging conversations with each other based on the reading sets.
- To reflect real-world conversations, the conversational partners didn’t have any explicitly defined roles and the reading sets provided to them were symmetric or asymmetric to varying degrees.

What’s the key achievement?

Introducing the largest social-conversation dataset available publicly to the research community.
Providing several strong baselines for future research by training simple Transformer-based models for response generation and assessing them with automated metrics and human evaluations.

What are possible business applications?

The Topical-Chat dataset can be leveraged to boost the performance of open-domain chatbots and help them hold coherent and natural conversations with humans about fashion, books, movies, etc.

Where can you get implementation code?

The Topical-Chat dataset is available on GitHub.

If you like these research summaries, you might be also interested in the following articles:

We’ll let you know when we release more summary articles like this one.

Bots

Brands

Business

China

Commerce

Computer Vision

Conversational AI

Customer Service

Cybersecurity

Data Science & Engineering

Design

Education

Ethics & Safety

Finance

Gaming

Healthcare

HR & Recruiting

Infrastructure

Leadership & Management

Manufacturing

Marketing

Natural Language Processing

Reinforcement Learning

Research

Retail & CPG

Society

Technical Guide

Technology

About TOPBOTS