This research summary is part of our Conversational AI series which covers the latest AI & machine learning approaches in the following areas:
In this research summary, we feature some of the most interesting novel approaches to improving the performance of task-oriented, or goal-oriented, conversational agents. Task-oriented dialog systems assist users with completing certain tasks like hotel booking or weather queries, making them particularly valuable for real-world applications.
A good task-oriented dialog agent should be able to understand the user, request additional input from a user if necessary, provide all the information requested by a user, and complete a task successfully in a minimum number of turns. Modern chatbots are also expected to handle long dialog context and multiple domains (e.g., restaurant booking, hotel booking, flight schedules, etc.).
We’ve curated and summarized the recently introduced research papers that propose the most effective solutions for building state-of-the-art task-oriented dialog systems.
If these accessible AI research analyses & summaries are useful for you, you can subscribe to receive our regular industry updates below.
If you’d like to skip around, here are the papers we featured:
- Rethinking Action Spaces for Reinforcement Learning in End-to-end Dialog Agents with Latent Variable Models
- Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog
- Task-Oriented Conversation Generation Using Heterogeneous Memory Networks
- Attention-Informed Mixed-Language Training for Zero-shot Cross-lingual Task-oriented Dialogue Systems
- Modeling Long Context for Task-Oriented Dialogue State Generation
- Multi-Domain Dialogue Acts and Response Co-Generation
- Zero-Shot Transfer Learning with Synthesized Data for Multi-Domain Dialogue State Tracking
Task-oriented Dialog Systems
1. Rethinking Action Spaces for Reinforcement Learning in End-to-end Dialog Agents with Latent Variable Models, by Tiancheng Zhao, Kaige Xie, Maxine Eskenazi
Original Abstract
Defining action spaces for conversational agents and optimizing their decision-making process with reinforcement learning is an enduring challenge. Common practice has been to use handcrafted dialog acts, or the output vocabulary, e.g. in neural encoder decoders, as the action spaces. Both have their own limitations. This paper proposes a novel latent action framework that treats the action spaces of an end-to-end dialog agent as latent variables and develops unsupervised methods in order to induce its own action space from the data. Comprehensive experiments are conducted examining both continuous and discrete action types and two different optimization methods based on stochastic variational inference. Results show that the proposed latent actions achieve superior empirical performance improvement over previous word-level policy gradient methods on both DealOrNoDeal and MultiWoz dialogs. Our detailed analysis also provides insights about various latent variable approaches for policy learning and can serve as a foundation for developing better latent actions in future research.
Our Summary
The research team from Carnegie Mellon University and Shanghai Jiao Tong University suggests rethinking action space for reinforcement learning (RL) in end-to-end (E2E) dialog agents. In contrast to common practices where hand-crafted dialog acts or the entire vocabulary are used as action spaces, the researchers propose to treat the action spaces of the end-to-end dialog agent as latent variables and leverage unsupervised techniques to induce action spaces from the data. Thus, their proposed framework, called Latent Action Reinforcement Learning (LaRL), overcomes the limitations of word-level RL, while getting the benefits of a traditional modular approach. The extensive evaluation demonstrates that LaRL significantly outperforms word-level policy gradient methods on both DealOrNoDeal and MultiWoz datasets.
What’s the core idea of this paper?
- When applying reinforcement learning to dialog agents, the common practice is to use either hand-crafted actions or the entire vocabulary (word-level RL) as action spaces. However, these approaches have a number of limitations:
- The approach with hand-crafted representations can handle only simple domains.
- The word-level RL:
- leads to degenerate behavior as the response decoder generates utterances that are incomprehensible;
- suffers from credit assignment over a long horizon, resulting in slow and suboptimal convergence.
- To overcome these limitations, the authors introduce a flexible latent variable dialog framework, called Latent Action Reinforcement Learning (LaRL). The core idea is to develop end-to-end models that can invent their own discourse-level actions, and then apply any RL technique to this induced action space.
- The paper also proposes:
- a novel training objective that outperforms the typical evidence lower bound;
- an attention mechanism for integrating discrete latent variables in the decoder for better modeling of long responses.
What’s the key achievement?
- Outperforming the previous state-of-the-art approaches on MultiWoz by 18.2% in terms of inform and success rates.
- Discovering novel and diverse negotiation strategies on the DealOrNoDeal dataset.
- Demonstrating that distinct latent actions are more suitable than continuous ones to serve as action spaces for reinforcement learning dialog agents.
What does the AI community think?
- The paper was presented at NAACL-HLT 2019, one of the leading conferences in natural language processing.
What are future research areas?
- Further exploring the effectiveness of practical latent variables in dialog agents with regards to creating action abstraction in an unsupervised manner.
What are possible business applications?
- The proposed approach can significantly enhance the performance of task-oriented dialog agents, enabling them to:
- give all the requested information;
- provide relevant recommendations;
- discover and apply novel negotiation strategies (e.g., elicit preference questions, request different offers, insist on key objects, etc.).
Where can you get implementation code?
- The data and code implementation are released on GitHub.
2. Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog, by Ryuichi Takanobu, Hanlin Zhu, Minlie Huang
Original Abstract
Dialog policy decides what and how a task-oriented dialog system will respond, and plays a vital role in delivering effective conversations. Many studies apply Reinforcement Learning to learn a dialog policy with the reward function which requires elaborate design and pre-specified user goals. With the growing needs to handle complex goals across multiple domains, such manually designed reward functions are not affordable to deal with the complexity of real-world tasks. To this end, we propose Guided Dialog Policy Learning, a novel algorithm based on Adversarial Inverse Reinforcement Learning for joint reward estimation and policy optimization in multi-domain task-oriented dialog. The proposed approach estimates the reward signal and infers the user goal in the dialog sessions. The reward estimator evaluates the state-action pairs so that it can guide the dialog policy at each dialog turn. Extensive experiments on a multi-domain dialog dataset show that the dialog policy guided by the learned reward function achieves remarkably higher task success than state-of-the-art baselines.
Our Summary
Specifying an effective reward function is challenging in task-oriented dialog, since a good reward function guides the policy dynamically to complete the task during the conversation instead of only evaluating the task success at the end of a session. Therefore, the researchers from Tsinghua University introduce a novel algorithm, called Guided Dialog Policy Learning (GDPL), which leverages Inverse Reinforcement Learning to estimate the reward. They also integrate Adversarial Learning into the method to train the policy and reward estimator simultaneously in an alternating way. The experiments on a multi-domain dialog corpus, MultiWoz, demonstrate that GDPL outperforms the state-of-the-art baselines in terms of the task success rate.
What’s the core idea of this paper?
- The key idea is to replace the hand-crafted reward function that evaluates the task success at the end of the session with a reward function that can guide the policy dynamically to complete the task during the conversation.
- To this end, they introduce a Guided Dialog Policy Learning (GDPL) algorithm:
- A reward estimator is built via Inverse Reinforcement Learning (i.e., the reward is estimated based on human behaviors and interactions derived from real human–human dialog sessions).
- The policy and estimator are trained simultaneously thanks to integrating Adversarial Learning (i.e., the policy and estimator take turns to improve each other during training).
- The policy is guided dynamically throughout the conversation as the reward estimator evaluates the generated dialog session using state-action pairs instead of the entire session.
What’s the key achievement?
- The evaluation on the MultiWoz benchmark demonstrates that GDPL:
- outperforms state-of-the-art baselines in terms of:
- inform F1 rate, which evaluates whether the user has been informed of all the requested information (94.97 vs. 83.34);
- match rate, which evaluates whether the booked entities match all the indicated constraints (83.90 vs. 69.09); and
- the general indicator of dialog success (86.5 vs. 61.2).
- outperforms humans in:
- informing the user about all the requested information (94.97 vs. 66.89);
- completing the task (86.5 vs. 75.0).
- outperforms state-of-the-art baselines in terms of:
What does the AI community think?
- The paper was accepted for oral presentation at EMNLP 2019, one of the leading conferences in natural language processing.
What are future research areas?
- Designing an agenda-based user simulator for dialog policy learning by leveraging the GDPL approach.
What are possible business applications?
- The introduced approach to reward estimation for task-oriented dialog can be directly leveraged in a number of real-world applications, such as:
- booking or recommending a restaurant/hotel;
- discovering information about the train/bus schedule;
- finding the closest hospital/police office.
Where can you get implementation code?
- The PyTorch implementation of the paper is available on GitHub.
3. Task-Oriented Conversation Generation Using Heterogeneous Memory Networks, by Zehao Lin, Xinjing Huang, Feng Ji, Haiqing Chen, Ying Zhang
Original Abstract
How to incorporate external knowledge into a neural dialogue model is critically important for dialogue systems to behave like real humans. To handle this problem, memory networks are usually a great choice and a promising way. However, existing memory networks do not perform well when leveraging heterogeneous information from different sources. In this paper, we propose a novel and versatile external memory networks called Heterogeneous Memory Networks (HMNs), to simultaneously utilize user utterances, dialogue history and background knowledge tuples. In our method, historical sequential dialogues are encoded and stored into the context-aware memory enhanced by gating mechanism while grounding knowledge tuples are encoded and stored into the context-free memory. During decoding, the decoder augmented with HMNs recurrently selects each word in one response utterance from these two memories and a general vocabulary. Experimental results on multiple real-world datasets show that HMNs significantly outperform the state-of-the-art data-driven task-oriented dialogue models in most domains.
Our Summary
The research team from Zhejiang University and Alibaba Group addresses the problem of incorporating heterogeneous information from different external knowledge sources into a task-oriented dialog model. To this end, they introduce a novel seq2seq neural conversation model augmented with Heterogeneous Memory Networks (HMNs), where sequential dialog history and background knowledge are modeled with two different kinds of memory networks, context-aware and context-free memories respectively. In the context-aware memory network, the gating mechanism is used to encode dialog history and user utterances. The extensive series of experiments demonstrates that context-aware memory networks can learn and store more meaningful representations than the alternative memory approaches, leading to the suggested model significantly outperforming the state-of-the-art task-oriented dialog models.
What’s the core idea of this paper?
- Existing memory networks typically treat information from multiple sources (e.g., sequential dialog history and structured knowledge bases) equally. Such methods result in two major challenges:
- modeling different types of structured information in only one memory network;
- modeling the effectiveness of knowledge from different sources in such a single memory network.
- To address these difficulties, the authors introduce a novel seq2seq neural conversation model augmented with Heterogeneous Memory Networks (HMNs). They suggest modeling dialog history and grounding external knowledge with two different networks:
- Context-aware memory encodes dialog history and user utterances using a gating mechanism.
- Context-free memory loads knowledge base information.
- The output of context-aware memory is fed to the context-free memory to search the representations of similar knowledge.
What’s the key achievement?
- The experiments on multiple datasets demonstrate the superiority of Heterogeneous Memory Networks in generating a fluent and accurate response in most tasks:
- on the Key-Value Retrieval dataset, HMNs outperform all state-of-the-art models in terms of BLEU and F1 scores, except for the F1 score for weather information retrieval;
- on the DSTC 2 dataset, the introduced model gets the best F1 score but falls behind seq2seq model with self-attention in terms of BLEU score;
- on bAbI tasks, HMNs achieve the best results on all tasks except for one (T5).
What does the AI community think?
- The paper was accepted for oral presentation at EMNLP 2019, one of the leading conferences in natural language processing.
What are future research areas?
- Adding other memory networks when more types of information need to be integrated.
What are possible business applications?
- Better incorporating the heterogeneous information from different external knowledge bases into task-oriented dialog systems and, as a result, enhancing the general performance of such systems.
4. Attention-Informed Mixed-Language Training for Zero-shot Cross-lingual Task-oriented Dialogue Systems, by Zihan Liu, Genta Indra Winata, Zhaojiang Lin, Peng Xu, Pascale Fung
Original Abstract
Recently, data-driven task-oriented dialogue systems have achieved promising performance in English. However, developing dialogue systems that support low-resource languages remains a long-standing challenge due to the absence of high-quality data. In order to circumvent the expensive and time-consuming data collection, we introduce Attention-Informed Mixed-Language Training (MLT), a novel zero-shot adaptation method for cross-lingual task-oriented dialogue systems. It leverages very few task-related parallel word pairs to generate code-switching sentences for learning the inter-lingual semantics across languages. Instead of manually selecting the word pairs, we propose to extract source words based on the scores computed by the attention layer of a trained English task-related model and then generate word pairs using existing bilingual dictionaries. Furthermore, intensive experiments with different cross-lingual embeddings demonstrate the effectiveness of our approach. Finally, with very few word pairs, our model achieves significant zero-shot adaptation performance improvements in both cross-lingual dialogue state tracking and natural language understanding (i.e., intent detection and slot filling) tasks compared to the current state-of-the-art approaches, which utilize a much larger amount of bilingual data.
Our Summary
The researchers from the Center for Artificial Intelligence Research (CAiRE) address the problem of developing task-oriented dialog systems for low-resource languages. To avoid expensive and time-consuming data collection, the research team introduces the Attention-Informed Mixed-Language Training (MLT), a novel framework for building zero-shot cross-lingual task-oriented dialog systems with an extremely small number of bilingual word pairs. The model is trained on mixed-language sentences that are built by first selecting English keywords using attention scores from a trained English model and then pairing these English words with target words using existing bilingual dictionaries. The evaluation demonstrates that the suggested approach achieves state-of-the-art zero-shot cross-lingual performance in both dialog state tracking and natural language understanding using far fewer bilingual resources.
What’s the core idea of this paper?
- The paper investigates zero-shot cross-lingual task-oriented dialog systems in the extremely low bilingual resources setting.
- The authors mention at least two major issues with the existing research on zero-shot cross-lingual learning in task-oriented dialog systems:
- To address these issues, the researchers introduce Attention-Informed Mixed-Language Training (MLT), a new zero-shot adaptation method for cross-lingual task-oriented dialog systems:
- To create word pairs, the system first chooses the keywords from the English training data using attention scores from a trained English model.
- Then, these English words are paired with target words using existing bilingual dictionaries.
- The target words replace English keywords and create mixed-language sentences in training data.
- Additionally, the researchers incorporate MUSE, RCSLS, XLM, and Multilingual BERT for generating cross-lingual embeddings.
What’s the key achievement?
- Outperforming state-of-the-art baselines in terms of dialog state tracking and natural language understanding while using only a few word pairs and far fewer bilingual resources than the alternative approaches.
- Getting remarkable gains of 17.69% and 21.45% in intent prediction and slot filling performance, respectively, with only five word pairs.
What does the AI community think?
- The paper was accepted for oral presentation at AAAI 2020, one of the leading conferences in artificial intelligence.
What are possible business applications?
- Global companies may leverage the suggested approach for building task-oriented dialog systems that can support numerous low-resource languages and communicate with customers all over the world.
Where can you get implementation code?
- The PyTorch implementation of this research paper is released on GitHub.
5. Modeling Long Context for Task-Oriented Dialogue State Generation, by Jun Quan and Deyi Xiong
Original Abstract
Based on the recently proposed transferable dialogue state generator (TRADE) that predicts dialogue states from utterance-concatenated dialogue context, we propose a multi-task learning model with a simple yet effective utterance tagging technique and a bidirectional language model as an auxiliary task for task-oriented dialogue state generation. By enabling the model to learn a better representation of the long dialogue context, our approaches attempt to solve the problem that the performance of the baseline significantly drops when the input dialogue context sequence is long. In our experiments, our proposed model achieves a 7.03% relative improvement over the baseline, establishing a new state-of-the-art joint goal accuracy of 52.04% on the MultiWOZ 2.0 dataset.
Our Summary
The research team from Soochow University investigates the problem of dialog state generation when the input dialog context sequence is long. The recently introduced TRADE approach concatenates all the system and user utterances into a single dialog context sequence for slot-value prediction. It achieves very good results on the MultiWOZ 2.0 dataset but as the dialog context gets longer, it becomes more difficult to distinguish between system and user utterances, and the performance of TRADE degrades. To overcome this problem, the authors of this paper suggest two methods: (1) tagging the utterances with [sys] and [usr] symbols; (2) integrating a bi-directional language modeling module into the upstream of the model as an auxiliary task. The evaluation on the MultiWOZ 2.0 dataset shows that both methods achieve significant improvements over the baselines, while the combination of the two methods establishes a new state of the art.
What’s the core idea of this paper?
- To predict slot-value pairs in open vocabularies, the recently introduced TRADE method proposes encoding the entire dialog context and predicting the value for each slot using a copy-augmented decoder.
- This approach achieves state-of-the-art results but cannot properly handle longer dialog context as it becomes more difficult for the model to identify whether an utterance in the dialogue context is from the system or the user.
- To address this issue, the authors of this paper propose two different methods:
- inserting a [sys] tag in front of each system utterance and a[usr] tag in front of each user utterance to explicitly enhance the model’s ability to distinguish between system and user utterances;
- integrating a bi-directional language modeling module as an auxiliary task to improve representation learning of the dialog context.
What’s the key achievement?
- The experiments on the MultiWOZ 2.0 dataset demonstrate that:
- The full model with both tagging and language modeling outperforms several strong baselines, including TRADE, and sets new state-of-the-art results with joint accuracy of 52.04% and slot accuracy of 97.26%.
- The tagging alone gets a 1.53% improvement of joint accuracy compared to TRADE.
- The language modeling alone gets a 2.74% improvement of joint accuracy compared to TRADE.
What does the AI community think?
- The paper was accepted to ACL 2020, the leading conference in natural language processing.
What are possible business applications?
- The introduced approach can be incorporated into task-oriented dialog systems to improve modeling of long dialog context.
6. Multi-Domain Dialogue Acts and Response Co-Generation, by Kai Wang, Junfeng Tian, Rui Wang, Xiaojun Quan, Jianxing Yu
Original Abstract
Generating fluent and informative responses is of critical importance for task-oriented dialogue systems. Existing pipeline approaches generally predict multiple dialogue acts first and use them to assist response generation. There are at least two shortcomings with such approaches. First, the inherent structures of multi-domain dialogue acts are neglected. Second, the semantic associations between acts and responses are not taken into account for response generation. To address these issues, we propose a neural co-generation model that generates dialogue acts and responses concurrently. Unlike those pipeline approaches, our act generation module preserves the semantic structures of multi-domain dialogue acts and our response generation module dynamically attends to different acts as needed. We train the two modules jointly using an uncertainty loss to adjust their task weights adaptively. Extensive experiments are conducted on the large-scale MultiWOZ dataset and the results show that our model achieves very favorable improvement over several state-of-the-art models in both automatic and human evaluations.
Our Summary
The common practice for building task-oriented dialog systems is to decompose the task into several subtasks, including natural language understanding, dialog state tracking, dialog act prediction, and response generation. The researchers from Sun Yat-sen University and Alibaba Group claim that the pipeline approach – when the dialog act is predicted first and then used for response generation – has a number of limitations, like neglecting the inherent structures of multi-domain dialog acts and ignoring the semantic associations between acts and responses. To overcome these shortcomings, the authors suggest generating dialog acts and responses concurrently. They train the two modules jointly with an uncertainty loss to adaptively adjust the weight according to the task-specific uncertainty. The experiments on the MultiWOZ 2.0 dataset confirm the effectiveness of this approach.
What’s the core idea of this paper?
- In contrast to classification approaches that separately predict each act item (domain, action, and slot), the researchers suggest treating dialog act prediction as another sequence generation problem and generating dialog acts and responses concurrently (i.e., the MarCo approach):
- Dialog acts and response are co-generated based on the transformer encoder–decoder architecture.
- As for training, uncertainty loss is used for adaptive weighting.
- Such an approach has a number of advantages compared to classification approaches. Act sequence generation:
- preserves the interrelationships among dialog acts;
- allows close interactions with response generation.
What’s the key achievement?
- The experiments on the MultiWOZ 2.0 dataset demonstrate that the MarCo approach:
- outperforms all the baselines in terms of inform rate, request success, and combined score;
- shows inferior BLEU performance to the two HDSA models, but further analysis shows that it often generates responses that are inconsistent with references but preferred by human judges.
- The human evaluation demonstrates that the MarCo approach is superior to HDSA and comparable to human performance.
What does the AI community think?
- The paper was accepted to ACL 2020, the leading conference in natural language processing.
What are future research areas?
- Addressing the problem of token level repetition.
What are possible business applications?
- The introduced dialog acts and response co-generation model can significantly improve the performance of task-oriented dialog systems by improving understanding of users’ needs, providing them with relevant information, and allowing users to finish their goals in fewer turns.
Where can you get implementation code?
- The PyTorch implementation of the paper is released on GitHub.
7. Zero-Shot Transfer Learning with Synthesized Data for Multi-Domain Dialogue State Tracking, by Giovanni Campagna, Agata Foryciarz, Mehrad Moradshahi, Monica S. Lam
Original Abstract
Zero-shot transfer learning for multi-domain dialogue state tracking can allow us to handle new domains without incurring the high cost of data acquisition. This paper proposes new zero-short transfer learning technique for dialogue state tracking where the in-domain training data are all synthesized from an abstract dialogue model and the ontology of the domain. We show that data augmentation through synthesized data can improve the accuracy of zero-shot learning for both the TRADE model and the BERT-based SUMBT model on the MultiWOZ 2.1 dataset. We show training with only synthesized in-domain data on the SUMBT model can reach about 2/3 of the accuracy obtained with the full training dataset. We improve the zero-shot learning state of the art on average across domains by 21%.
Our Summary
The Stanford University research team suggests a new solution for dialog state tracking in new domains without spending a lot of resources on new data acquisition. Specifically, they introduce an algorithm that accepts an ontology of a domain and a few phrases commonly used in that domain to synthesize dialog training data based on an abstract dialog model. Then, this synthesized in-domain data and existing data for other domains are used for training and transferring knowledge to a new domain in a zero-shot setting. The experiments demonstrate that the introduced approach improves the accuracy of both the TRADE model and the SUMBT model, suggesting that the technique is independent of the specific model used. For the MultiWOZ 2.1 tasks, the introduced approach shows an average improvement of 21% over the previous state-of-the-art result on zero-shot transfer learning.
What’s the core idea of this paper?
- The paper introduces a new zero-shot transfer learning technique for dialog state tracking where the in-domain training data is all synthesized:
- The algorithm synthesizes dialog training data based on an abstract dialog model, an ontology of a domain, and a few phrases commonly used in that domain.
- Additionally, training samples from related domains are adapted to the new domain by substituting them with the vocabulary of the new domain.
- The accuracy of the abstract dialog model and the state-tracking neural network can be further improved by iteratively refining the model following the error analysis on the validation dataset and by adding annotations in the new domain.
What’s the key achievement?
- The experiments on the MultiWOZ 2.1 dataset demonstrate that:
- synthesized data improves zero-shot accuracy on all domains for both TRADE (up to 19%) and SUMBT (up to 30%) models;
- training with synthesized data is about half as good as full training for TRADE and about 2/3 as good as full training for SUMBT;
- The introduced zero-shot transfer learning technique using the BERT-based SUMBT model improves the zero-shot state of the art by 21% on average across domains, suggesting that pretraining complemented with the use of synthesized data is an effective technique for bootstrapping new dialog systems.
What does the AI community think?
- The paper was accepted to ACL 2020, the leading conference in natural language processing.
What are possible business applications?
- The proposed method for synthesizing dialogs in new domains can be a very cheap and yet effective technique for adapting dialog systems to new domains, especially when applied to pretraining-based dialog models.
Where can you get implementation code?
- The exact version of the tool used for the experiments, as well as the generated datasets, are available on GitHub.
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more summary articles like this one.
Leave a Reply
You must be logged in to post a comment.