2020 is the breakthrough year for conversational agents. First, Google’s chatbot Meena and Facebook’s chatbot Blender demonstrated that dialog agents can achieve close to human-level performance in certain tasks. Then, OpenAI’s GPT-3 model made lots of people wonder whether Artificial General Intelligence (AGI) is already here. While we are still a long way off true AGI, conversations with GPT-3 based chatbots can be very entertaining.
To help you stay aware of the latest research breakthroughs in conversational AI, we have summarized the key ideas from these three research papers:
- Towards a Human-like Open-Domain Chatbot (aka Meena by Google)
- Recipes For Building an Open-Domain Chatbot (aka Blender by Facebook)
- Language Models are Few-Shot Learners (aka GPT-3 by OpenAI)
Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.
Key Research Advances in Dialog Agents 2020
Towards a Human-like Open-Domain Chatbot, by Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, Quoc V. Le
Original Abstract
We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. We also propose a human evaluation metric called Sensibleness and Specificity Average (SSA), which captures key elements of a human-like multi-turn conversation. Our experiments show strong correlation between perplexity and SSA. The fact that the best perplexity end-to-end trained Meena scores high on SSA (72% on multi-turn evaluation) suggests that a human-level SSA of 86% is potentially within reach if we can better optimize perplexity. Additionally, the full version of Meena (with a filtering mechanism and tuned decoding) scores 79% SSA, 23% higher in absolute SSA than the existing chatbots we evaluated.
Our Summary
In contrast to most modern conversational agents, which are highly specialized, the Google research team introduces a chatbot Meena that can chat about virtually anything. It’s built on a large neural network with 2.6B parameters trained on 341 GB of text. The researchers also propose a new human evaluation metric for open-domain chatbots, called Sensibleness and Specificity Average (SSA), which can capture important attributes for human conversation. They demonstrate that this metric correlates highly with perplexity, an automatic metric that is readily available. Thus, the Meena chatbot, which is trained to minimize perplexity, can conduct conversations that are more sensible and specific compared to other chatbots. Particularly, the experiments demonstrate that Meena outperforms existing state-of-the-art chatbots by a large margin in terms of the SSA score (79% vs. 56%) and is closing the gap with human performance (86%).
What’s the core idea of this paper?
- Despite recent progress, open-domain chatbots still have significant weaknesses: their responses often do not make sense or are too vague or generic.
- To address these issues, the Google research team introduces Meena, a generative conversational model with 2.6B parameters trained on 40B words mined from public social media conversations:
- Meena is built on a seq2seq model with Evolved Transformer (ET) that includes 1 ET encoder block and 13 ET decoder blocks.
- The model is trained on multi-turn conversations with the input sequence including all turns of the context (up to 7) and the output sequence being the response.
- To measure the quality of open-domain chatbots, such as Meena, the researchers introduce a new human-evaluation metric, called Sensibleness and Sensitivity Average (SSA), that measures two fundamental aspects of a chatbot:
- making sense,
- being specific.
- The research team discovered that the SSA metric shows high negative correlation (R2 = 0.93) with perplexity, a readily available automatic metric that Meena is trained to minimize.
What’s the key achievement?
- Proposing a simple human-evaluation metric for open-domain chatbots.
- Demonstrating that a large-scale low-perplexity model can be a good conversationalist:
- The best end-to-end trained Meena model outperforms existing state-of-the-art open-domain chatbots by a large margin, achieving an SSA score of 72% (vs. 56%).
- Furthermore, the full version of Meena, with a filtering mechanism and tuned decoding, further advances the SSA score to 79%, which is not far from the 86% SSA achieved by the average human.
What does the AI community think?
- “Google’s “Meena” chatbot was trained on a full TPUv3 pod (2048 TPU cores) for 30 full days – that’s more than $1,400,000 of compute time to train this chatbot model.” – Elliot Turner, CEO and founder of Hyperia.
- “So I was browsing the results for the new Google chatbot Meena, and they look pretty OK (if boring sometimes). However, every once in a while it enters ‘scary sociopath mode,’ which is, shall we say, sub-optimal” – Graham Neubig, Associate professor at Carnegie Mellon University.
What are future research areas?
- Lowering the perplexity through improvements in algorithms, architectures, data, and compute.
- Considering other aspects of conversations beyond sensibleness and specificity, such as, for example, personality and factuality.
- Tackling safety and bias in the models.
What are possible business applications?
- The authors suggest some interesting applications for open-domain chatbots such as Meena:
- further humanizing computer interactions;
- improving foreign language practice;
- making interactive movie and videogame characters relatable.
Where can you get implementation code?
- Considering the challenges related to safety and bias in the models, the authors haven’t released the Meena model yet. However, they are still evaluating the risks and benefits and may decide otherwise in the coming months.
Recipes for Building an Open-Domain Chatbot, by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston
Original Abstract
Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that scaling neural models in the number of parameters and the size of the data they are trained on gives improved results, we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent persona. We show that large scale models can learn these skills when given appropriate training data and choice of generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models.
Our Summary
The Facebook AI Research team shows that with appropriate training data and generation strategy, large-scale models can learn many important conversational skills, such as engagingness, knowledge, empathy, and persona consistency. Thus, to build their state-of-the-art conversational agent, called BlenderBot, they leveraged a model with 9.4B parameters, trained it on a novel task called Blended Skill Talk, and deployed beam search with carefully selected hyperparameters as a generation strategy. Human evaluations demonstrate that BlenderBot outperforms Meena in pairwise comparison 75% to 25% in terms of engagingness and 65% to 35% in terms of humanness.
What’s the core idea of this paper?
- The introduced recipe for building a state-of-the-art open-domain chatbot includes three key ingredients:
- Large scale. The largest model has 9.4 billion parameters and was trained on 1.5 billion training examples of extracted conversations.
- Blended skills. The chatbot was trained on the Blended Skill Talk task to learn such skills as engaging use of personality, engaging use of knowledge, and display of empathy.
- Beam search used for decoding. The researchers show that this generation strategy, deployed with carefully selected hyperparameters, gives strong results. In particular, it was demonstrated that the lengths of the agent’s utterances is very important for chatbot performance (i.e, too short responses are often considered dull and too long responses make the chatbot appear to waffle and not listen).
What’s the key achievement?
- The introduced chatbot outperforms the previous best-performing open-domain chatbot Meena. Thus, in pairwise match-ups, BlenderBot with 2.7B parameters wins:
- 75% of the time in terms of engagingness;
- 65% of the time in terms of humanness.
- In an A/B comparison between human-to-human and human-to-BlenderBot conversations, the latter were preferred 49% of the time as more engaging.
What are future research areas?
- Addressing limitations of the introduced conversational agent, including:
- a lack of in-depth knowledge if sufficiently interrogated;
- a tendency to use simpler language;
- a tendency to repeat oft-used phrases.
- Further exploring unlikelihood training and retrieve-and-refine mechanisms as potential avenues for fixing these issues.
Where can you get implementation code?
- Facebook AI open-sourced BlenderBot by releasing code to fine-tune the conversational agent, the model weights, and code to evaluate it.
Language Models are Few-Shot Learners, by Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei
Original Abstract
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10× more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.
Our Summary
The OpenAI research team draws attention to the fact that the need for a labeled dataset for every new language task limits the applicability of language models. Considering that there is a wide range of possible tasks and it’s often difficult to collect a large labeled training dataset, the researchers suggest an alternative solution, which is scaling up language models to improve task-agnostic few-shot performance. They test their solution by training a 175B-parameter autoregressive language model, called GPT-3, and evaluating its performance on over two dozen NLP tasks. The evaluation under few-shot learning, one-shot learning, and zero-shot learning demonstrates that GPT-3 achieves promising results and even occasionally outperforms the state of the art achieved by fine-tuned models.
What’s the core idea of this paper?
- The GPT-3 model uses the same model and architecture as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization.
- However, in contrast to GPT-2, it uses alternating dense and locally banded sparse attention patterns in the layers of the transformer, as in the Sparse Transformer.
- The model is evaluated in three different settings:
- Few-shot learning, when the model is given a few demonstrations of the task (typically, 10 to 100) at inference time but with no weight updates allowed.
- One-shot learning, when only one demonstration is allowed, together with a natural language description of the task.
- Zero-shot learning, when no demonstrations are allowed and the model has access only to a natural language description of the task.
What’s the key achievement?
- The GPT-3 model without fine-tuning achieves promising results on a number of NLP tasks, and even occasionally surpasses state-of-the-art models that were fine-tuned for that specific task:
- On the CoQA benchmark, 81.5 F1 in the zero-shot setting, 84.0 F1 in the one-shot setting, and 85.0 F1 in the few-shot setting, compared to the 90.7 F1 score achieved by fine-tuned SOTA.
- On the TriviaQA benchmark, 64.3% accuracy in the zero-shot setting, 68.0% in the one-shot setting, and 71.2% in the few-shot setting, surpassing the state of the art (68%) by 3.2%.
- On the LAMBADA dataset, 76.2 % accuracy in the zero-shot setting, 72.5% in the one-shot setting, and 86.4% in the few-shot setting, surpassing the state of the art (68%) by 18%.
- The news articles generated by the 175B-parameter GPT-3 model are hard to distinguish from real ones, according to human evaluations (with accuracy barely above the chance level at ~52%).
What does the AI community think?
- “The GPT-3 hype is way too much. It’s impressive (thanks for the nice compliments!) but it still has serious weaknesses and sometimes makes very silly mistakes. AI is going to change the world, but GPT-3 is just a very early glimpse. We have a lot still to figure out.” – Sam Altman, CEO and co-founder of OpenAI.
- “I’m shocked how hard it is to generate text about Muslims from GPT-3 that has nothing to do with violence… or being killed…” – Abubakar Abid, CEO and founder of Gradio.
- “No. GPT-3 fundamentally does not understand the world that it talks about. Increasing corpus further will allow it to generate a more credible pastiche but not fix its fundamental lack of comprehension of the world. Demos of GPT-4 will still require human cherry picking.” – Gary Marcus, CEO and founder of Robust.ai.
- “Extrapolating the spectacular performance of GPT3 into the future suggests that the answer to life, the universe and everything is just 4.398 trillion parameters.” – Geoffrey Hinton, Turing Award winner.
What are future research areas?
- Improving pre-training sample efficiency.
- Exploring how few-shot learning works.
- Distillation of large models down to a manageable size for real-world applications.
What are possible business applications?
- The model with 175B parameters is hard to apply to real business problems due to its impractical resource requirements, but if the researchers manage to distill this model down to a workable size, it could be applied to a wide range of language tasks, including question answering, dialog agents, and ad copy generation.
Where can you get implementation code?
- The code itself is not available, but some dataset statistics together with unconditional, unfiltered 2048-token samples from GPT-3 are released on GitHub.
Challenges in Building Open-Domain Chatbots
Even though the recently introduced conversational agents demonstrate remarkable performance, there is still room for improvement. In particular, Huang, Zhu, and Gao (2019) discuss three challenges in developing outstanding open-domain dialog agents:
- Semantics. We expect a conversational agent to understand the content of the dialog, and also consider the user’s emotional and social needs during the conversation.
- Consistency. For us to trust a virtual assistant, it should demonstrate a consistent personality.
- Interactiveness. A decent dialog system should be able to achieve complex social goals such as entertainment and conforming.
These are the issues that remain unsolved despite the significant progress of recently introduced dialog systems. However, many interesting research ideas addressing these challenges have been presented at the top AI and NLP academic conferences this year.
In this article, we feature the most interesting research papers that introduce effective solutions for having meaningful, engaging, persona-consistent, and empathetic conversations with chatbots.
Enjoy this article? Sign up for more AI research updates.
We’ll let you know when we release more summary articles like this one.
Leave a Reply
You must be logged in to post a comment.