Breakthrough Research In Reinforcement Learning From 2019

Reinforcement learning (RL) continues to be less valuable for business applications than supervised learning, and even unsupervised learning. It is successfully applied only in areas where huge amounts of simulated data can be generated, like robotics and games.

However, many experts recognize RL as a promising path towards Artificial General Intelligence (AGI), or true intelligence. Thus, research teams from top institutions and tech leaders are seeking ways to make RL algorithms more sample-efficient and stable.

We’ve selected and summarized 10 research papers that we think are representative of the latest research trends in reinforcement learning. The papers explore, among others, the interaction of multiple agents, off-policy learning, and more efficient exploration.

Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.

If you’d like to skip around, here are the papers we featured:

How to Combine Tree-Search Methods in Reinforcement Learning
Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables
Policy Certificates: Towards Accountable Reinforcement Learning
Distributional Reinforcement Learning for Efficient Exploration
Better Exploration with Optimistic Actor-Critic
Guided Meta-Policy Search
Using a Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning
Emergent Tool Use From Multi-Agent Autocurricula
Solving Rubik’s Cube with a Robot Hand

10 Important Reinforcement Learning Research Papers of 2019

1. How to Combine Tree-Search Methods in Reinforcement Learning, by Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor

Original Abstract

Finite-horizon lookahead policies are abundantly used in Reinforcement Learning and demonstrate impressive empirical success. Usually, the lookahead policies are implemented with specific planning methods such as Monte Carlo Tree Search (e.g. in AlphaZero). Referring to the planning problem as tree search, a reasonable practice in these implementations is to back up the value only at the leaves while the information obtained at the root is not leveraged other than for updating the policy. Here, we question the potency of this approach. Namely, the latter procedure is non-contractive in general, and its convergence is not guaranteed. Our proposed enhancement is straightforward and simple: use the return from the optimal tree path to back up the values at the descendants of the root. This leads to a $\gamma^h$ -contracting procedure, where $\gamma$ is the discount factor and h is the tree depth. To establish our results, we first introduce a notion called multiple-step greedy consistency. We then provide convergence rates for two algorithmic instantiations of the above enhancement in the presence of noise injected to both the tree search stage and value estimation stage.

Our Summary

In this paper, the Technion research team explores ways to improve the implementation of lookahead policies. The usual practice is to back up the value only at the leaves, but this way, the algorithm doesn’t leverage information obtained at the root, except for updating the policy. Thus, the researchers have discovered that even state-of-the-art implementations like Monte Carlo Tree Search, used in Alpha Zero, do not necessarily converge to an optimal value. To solve this problem, the authors propose a simple enhancement: to back up the values at the descendants of the root by using the return from the optimal tree path. The experiments in the paper show that the enhancement performs better than the “naive” tree search algorithm, which supports the paper’s theoretical analysis.

What’s the core idea of this paper?

Tree search implementations usually only back up the value at the leaves. This procedure doesn’t leverage information obtained at the root and is not guaranteed to converge.
The paper introduces a new procedure that uses the return from the optimal tree path to back up the values at the descendants of the root.
This new approach enables bootstrapping of the optimal value obtained from the h – 1 horizon optimal planning problem instead of the “usual” value function.

What’s the key achievement?

The experiments in the paper show that the enhancement performs significantly better than the traditional approach, especially when combined with short-horizon evaluation.

What does the AI community think?

The paper received the AAAI 2019 Outstanding Paper Award.

What are future research areas?

Further analysis of the non-contractive algorithms and understanding when they perform well.
Performing a more in-depth analysis of the results from utilizing the planning byproducts.

What are possible business applications?

Even though the contribution of this paper is primarily theoretical, the proposed approach can benefit a wide variety of applications, including state-of-the-art game-playing AI, route-finding, and scheduling.

2. Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning, by Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro A. Ortega, DJ Strouse, Joel Z. Leibo, Nando de Freitas

Original Abstract

We propose a unified mechanism for achieving coordination and communication in Multi-Agent Reinforcement Learning (MARL), through rewarding agents for having causal influence over other agents’ actions. Causal influence is assessed using counterfactual reasoning. At each timestep, an agent simulates alternate actions that it could have taken, and computes their effect on the behavior of other agents. Actions that lead to bigger changes in other agents’ behavior are considered influential and are rewarded. We show that this is equivalent to rewarding agents for having high mutual information between their actions. Empirical results demonstrate that influence leads to enhanced coordination and communication in challenging social dilemma environments, dramatically increasing the learning curves of the deep RL agents, and leading to more meaningful learned communication protocols. The influence rewards for all agents can be computed in a decentralized way by enabling agents to learn a model of other agents using deep neural networks. In contrast, key previous works on emergent communication in the MARL setting were unable to learn diverse policies in a decentralized manner and had to resort to centralized training. Consequently, the influence reward opens up a window of new opportunities for research in this area.

Our Summary

In this paper, the authors consider the problem of deriving intrinsic social motivation from other agents in multi-agent reinforcement learning (MARL). The approach is to reward agents for having a causal influence on other agents’ actions to achieve both coordination

and communication in MARL. Specifically, it is demonstrated that rewarding actions that lead to a relatively higher change in another agent’s behavior are related to maximizing the mutual information flow between agents’ actions. As a result, such an inductive bias motivates agents to learn coordinated behavior. The experiments confirm the effectiveness of the proposed social influence reward in enhancing coordination and communication between the agents.

A moment of high influence when the purple influencer signals the presence of an apple (green tiles) outside the yellow influencee’s field-of-view (yellow outlined box)

What’s the core idea of this paper?

The paper addresses a long-standing problem of coordination and communication between multiple agents, including such limitations as centralized training and the sharing of reward functions or policy parameters.
The authors suggest giving agent an additional reward for having a causal influence on another agent’s actions.
As the next step, they enhance the social influence reward with the inclusion of explicit communication protocols.
Finally, they equip each agent with an internal neural network that is trained to predict the actions of other agents. That enables independent training of agents.

What’s the key achievement?

Demonstrating that social influence reward eventually leads to significantly higher collective reward and allows agents to learn meaningful communication protocols when this is otherwise impossible.
Introducing a framework for training the agents independently while still ensuring coordination and communication between them.

What does the AI community think?

The paper received the Honorable Mention Award at ICML 2019, one of the leading conferences in machine learning.

What are future research areas?

Using the proposed approach to develop a form of ‘empathy’ in agents so that they can simulate how their actions affect another agent’s value function.
Applying the influence reward to encourage different modules of the network to integrate information from other networks, for example, to prevent collapse in hierarchical RL.

What are possible business applications?

Driving coordinated behavior in robots attempting to cooperate in manipulation and control tasks.

3. Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables, by Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, Sergey Levine

Original Abstract

Deep reinforcement learning algorithms require large amounts of experience to learn an individual task. While in principle meta-reinforcement learning (meta-RL) algorithms enable agents to learn new skills from small amounts of experience, several major challenges preclude their practicality. Current methods rely heavily on on-policy experience, limiting their sample efficiency. They also lack mechanisms to reason about task uncertainty when adapting to new tasks, limiting their effectiveness in sparse reward problems. In this paper, we address these challenges by developing an off-policy meta-RL algorithm that disentangles task inference and control. In our approach, we perform online probabilistic filtering of latent task variables to infer how to solve a new task from small amounts of experience. This probabilistic interpretation enables posterior sampling for structured and efficient exploration. We demonstrate how to integrate these task variables with off-policy RL algorithms to achieve both meta-training and adaptation efficiency. Our method outperforms prior algorithms in sample efficiency by 20-100x as well as in asymptotic performance on several meta-RL benchmarks.

Our Summary

The UC Berkeley research team addresses the problem of efficient off-policy meta-reinforcement learning (meta-RL). Specifically, they suggest integrating “online inference of probabilistic context variables with existing off-policy RL algorithms” to get sample efficiency during meta-training as well as fast adaptation. The introduced off-policy meta-RL algorithm, called PEARL: Probabilistic Embeddings for Actor-critic RL, in effect, samples task hypotheses, attempts these tasks and then evaluates whether the hypotheses were true. The experiments demonstrate that PEARL outperforms existing state-of-the-art approaches by 20-100× in meta-training sample efficiency, and demonstrates significant improvement in asymptotic performance.

What’s the core idea of this paper?

Meta-RL algorithms suffer from poor sample efficiency using on-policy data, yet training meta-RL models on off-policy data introduces challenges such as a mismatch between meta-training time and meta-test time.
To address these challenges, the researchers introduce PEARL: Probabilistic Embeddings for Actor-critic RL, which combines existing off-policy algorithms with the online inference of probabilistic context variables:
- At meta-training, a probabilistic encoder accumulates the necessary statistics from past experience into the context variables.
- At meta-test time, temporally-extended exploration is enabled by sampling context variables and holding them constant for the duration of the episode.
- Then, fast trajectory-level adaptation is achieved by using the collected trajectories to update the posterior over the context variables.
In effect, the introduced approach allows optimizing the policy with off-policy data, while training the probabilistic encoder with on-policy data. Thus, the distribution mismatch between meta-train and meta-test is minimized.

What’s the key achievement?

The experimental evaluation on six continuous control meta-learning environments demonstrates that PEARL outperforms the previous state-of-the-art approaches in terms of:
- sample efficiency by using 20-100× fewer samples during meta-training;
- asymptotic performance with the results improved by 50-100% in five out of six domains.

What does the AI community think?

The paper was accepted for oral presentation at ICML 2019, one of the leading conferences in machine learning.

What are possible business applications?

The introduced approach allows us to significantly improve the efficiency of training autonomous agents.

Where can you get implementation code?

An open-source implementation of PEARL is available on GitHub.

4. Policy Certificates: Towards Accountable Reinforcement Learning, by Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill

Original Abstract

The performance of a reinforcement learning algorithm can vary drastically during learning because of exploration. Existing algorithms provide little information about the quality of their current policy before executing it, and thus have limited use in high-stakes applications like healthcare. We address this lack of accountability by proposing that algorithms output policy certificates. These certificates bound the sub-optimality and return of the policy in the next episode, allowing humans to intervene when the certified quality is not satisfactory. We further introduce two new algorithms with certificates and present a new framework for theoretical analysis that guarantees the quality of their policies and certificates. For tabular MDPs, we show that computing certificates can even improve the sample-efficiency of optimism-based exploration. As a result, one of our algorithms is the first to achieve minimax-optimal PAC bounds up to lower-order terms, and this algorithm also matches (and in some settings slightly improves upon) existing minimax regret bounds.

Our Summary

The unpredictable performance fluctuation of reinforcement learning (RL) algorithms limits their use in high-stakes applications like healthcare. To address this limitation, the authors of the paper suggest that algorithms reveal their performance during learning. Specifically, the researchers propose using policy certificates that output a confidence interval for the algorithm’s expected return in the next episode as well as a bound on how far from the optimal return the performance can go. With policy certificates in place, humans will be able to intervene when the performance of the RL algorithm drops below some minimum threshold. In addition, the authors introduce an IPOC framework that requires an algorithm to be an efficient learner and guarantees that the algorithm’s performance is within the limits shown in a policy certificate. Finally, the paper provides a new RL algorithm for finite episodic Markov decision processes (MDPs) that satisfies IPOC requirements and demonstrates stronger minimax regret and PAC guarantees than existing approaches.

What’s the core idea of this paper?

The joint research team from Carnegie Mellon University, Google Research, and Stanford explores the ways to make reinforcement learning algorithms more accountable.
To this end, they suggest that RL algorithms output policy certificates in episodic RL. These certificates should include:
- a confidence interval for the algorithm’s expected sum of rewards in the next episode;
- a bound on how far from the optimal return the performance can be.
In addition to accountability, the researchers also want RL algorithms to be sample-efficient. Thus, they introduce a new framework for theoretical analysis, called IPOC, that:
- ensures that policy certificates indeed bound the expected performance of an algorithm in an episode;
- prescribes the rate at which the policy and certificates of an algorithm improve with more data.

What’s the key achievement?

Introducing policy certificates and the IPOC framework, which together lead to increased accountability and sample-efficiency of RL algorithms.
Proposing a new algorithm for tabular MDPs that outperforms the previous state-of-the-art approaches in terms of regret and PAC guarantees.

What does the AI community think?

The paper was accepted for oral presentation at ICML 2019, one of the leading conferences in machine learning.

What are future research areas?

Scaling up the proposed ideas to continuous state spaces and extending them to model-free RL.
Providing per-episode risk-sensitive guarantees on the reward obtained.

What are possible business applications?

Enabling the use of reinforcement learning in high-stakes applications like healthcare and financial trading.

5. Distributional Reinforcement Learning for Efficient Exploration, by Borislav Mavrin, Shangtong Zhang, Hengshuai Yao, Linglong Kong, Kaiwen Wu, Yaoliang Yu

Original Abstract

In distributional reinforcement learning (RL), the estimated distribution of value function models both the parametric and intrinsic uncertainties. We propose a novel and efficient exploration method for deep RL that has two components. The first is a decaying schedule to suppress the intrinsic uncertainty. The second is an exploration bonus calculated from the upper quantiles of the learned distribution. In Atari 2600 games, our method outperforms QR-DQN in 12 out of 14 hard games (achieving 483% average gain across 49 games in cumulative rewards over QR-DQN with a big win in Venture). We also compared our algorithm with QR-DQN in a challenging 3D driving simulator (CARLA). Results show that our algorithm achieves near-optimal safety rewards twice faster than QRDQN.

Our Summary

In this paper, the researchers investigate how distributions learned by distributional RL methods can improve the efficiency of exploration. They start with the Quantile Regression Deep-Q-Network (QR-DQN) to learn the distribution of the value function. Then, the decaying schedule is used to suppress the intrinsic uncertainty. Finally, they estimate an optimistic exploration bonus for QR-DQN by using the upper quantiles of the learned distribution. The experiments demonstrate that the introduced algorithm significantly outperforms QR-DQN at Atari 2600 games and a 3D driving simulator.

What’s the core idea of this paper?

The authors propose a novel and efficient exploration approach that is based on using distributions learned via distributional RL methods:
- The authors show that these distributions model the randomness arising from intrinsic and parametric uncertainties.
- The composite effect of intrinsic and parametric uncertainties may be detrimental for efficient exploration.
Thus, they suggest using Quantile Regression Deep-Q-Network (QR-DQN) for learning distribution of the value function and then add two components:
- a decaying schedule to suppress the intrinsic uncertainty;
- an exploration bonus derived from the upper quantiles of the estimated distribution.

What’s the key achievement?

The suggested algorithm outperforms QR-DQN (with ɛ-greedy strategy):
- in 12 out of 14 hard Atari 2600 games, and with 483 % average gain in cumulative rewards across 49 games;
- in 3D driving simulator CARLA, by achieving near-optimal safety rewards twice as fast.

What does the AI community think?

The paper was accepted for oral presentation at ICML 2019, one of the leading conferences in machine learning.

What are future research areas?

Combining the introduced method with other recent advancements in deep RL (e.g., Rainbow) to get even better results.

6. Better Exploration with Optimistic Actor-Critic, by Kamil Ciosek, Quan Vuong, Robert Loftin, Katja Hofmann

Original Abstract

Actor-critic methods, a type of model-free Reinforcement Learning, have been successfully applied to challenging tasks in continuous control, often achieving state-of-the-art performance. However, wide-scale adoption of these methods in real-world domains is made difficult by their poor sample efficiency. We address this problem both theoretically and empirically. On the theoretical side, we identify two phenomena preventing efficient exploration in existing state-of-the-art algorithms such as Soft Actor Critic. First, combining a greedy actor update with a pessimistic estimate of the critic leads to the avoidance of actions that the agent does not know about, a phenomenon we call pessimistic underexploration. Second, current algorithms are directionally uninformed, sampling actions with equal probability in opposite directions from the current mean. This is wasteful, since we typically need actions taken along certain directions much more than others. To address both of these phenomena, we introduce a new algorithm, Optimistic Actor-Critic, which approximates a lower and upper confidence bound on the state-action value function. This allows us to apply the principle of optimism in the face of uncertainty to perform directed exploration using the upper bound while still using the lower bound to avoid overestimation. We evaluate OAC in several challenging continuous control tasks, achieving state-of-the-art sample efficiency.

Our Summary

Actor-critic methods usually suffer from poor sample efficiency. The Microsoft Research team aims at mitigating this problem by suggesting more efficient exploration. Specifically, they point out that state-of-the-art algorithms, such as Soft Actor-Critic and TD3, adjust the exploration policy using the lower bound, which improves stability but can inhibit exploration if the lower bound is far from the true Q-function. Next, Gaussian policies are also directionally uninformed as they sample actions with equal probability in opposing directions from the current mean. This paper introduces Optimistic Actor-Critic (OAC), which uses an off-policy exploration strategy and an upper bound to determine exploration covariance. The algorithm is directionally informed as the exploration policy is not required to have the same mean as the target policy. The experiments demonstrate that the introduced algorithm achieves state-of-the-art sample efficiency on the Humanoid benchmark.

What’s the core idea of this paper?

The paper addresses the problem of the poor sample efficiency of actor-critic methods.
The authors introduce Optimistic Actor-Critic (OAC), an algorithm with more efficient exploration that is achieved by “applying the principle of optimism in the face of uncertainty”:
- OAC uses an off-policy exploration strategy.
- The strategy maximizes an upper confidence bound for the critic, which is obtained from an estimate on the Q-function defined with the bootstrap.
- OAC is directionally informed and doesn’t sample parts of action space that have already been explored.
- Instability of the off-policy RL algorithms is addressed with a KL constraint between the exploration policy and the target policy.
- Finally, to avoid overestimation, the introduced algorithm updates its target policy using a lower confidence bound for the critic.

What’s the key achievement?

The empirical evaluation demonstrates that OAC achieves state-of-the-art sample efficiency on the Humanoid benchmark and outperforms Soft Actor-Critic, which was the most sample-efficient model-free RL algorithm for continuous tasks.

What does the AI community think?

The paper was accepted for the Spotlight presentation at NeurIPS 2019, the leading conference in artificial intelligence.

What are possible business applications?

Optimistic Actor-Critic can be applied to challenging tasks in continuous control to improve sample efficiency.

7. Guided Meta-Policy Search, by Russell Mendonca, Abhishek Gupta, Rosen Kralev, Pieter Abbeel, Sergey Levine, Chelsea Finn

Original Abstract

Reinforcement learning (RL) algorithms have demonstrated promising results on complex tasks, yet often require impractical numbers of samples because they learn from scratch. Meta-RL aims to address this challenge by leveraging experience from previous tasks in order to more quickly solve new tasks. However, in practice, these algorithms generally also require large amounts of on-policy experience during the meta-training process, making them impractical for use in many problems. To this end, we propose to learn a reinforcement learning procedure through imitation of expert policies that solve previously-seen tasks. This involves a nested optimization, with RL in the inner loop and supervised imitation learning in the outer loop. Because the outer loop imitation learning can be done with off-policy data, we can achieve significant gains in meta-learning sample efficiency. In this paper, we show how this general idea can be used both for meta-reinforcement learning and for learning fast RL procedures from multi-task demonstration data. The former results in an approach that can leverage policies learned for previous tasks without significant amounts of on-policy data during meta-training, whereas the latter is particularly useful in cases where demonstrations are easy for a person to provide. Across a number of continuous control meta-RL problems, we demonstrate significant improvements in meta-RL sample efficiency in comparison to prior work as well as the ability to scale to domains with visual observations.

Our Summary

The UC Berkeley research team introduces a new approach to improving the sample efficiency of meta-reinforcement learning (meta-RL). They suggest using a stable and efficient supervised imitation learning approach for the meta-optimization while enjoying the benefits of reinforcement learning in the inner loop of the optimization task. The expert policies for imitation can come from human-provided demonstrations or acquired through off-policy RL algorithms. The experiments confirm that the efficiency of the suggested approach is applicable in the real world.

What’s the core idea of this paper?

Meta-reinforcement learning is a promising approach to building flexible agents that will be able to use previous experience to manipulate new objects in new ways. However, meta-RL algorithms usually require lots of on-policy data during the meta-training process, which makes them impractical for real-world applications.
To address this challenge, the researchers suggest using a much more stable and efficient algorithm for supervision at the meta-level, namely supervised imitation learning:
- In the outer loop, expert actions are leveraged for more direct supervision. These expert policies can be:
  - produced automatically by standard RL methods;
  - acquired using efficient off-policy RL algorithms;
  - collected from human-provided demonstrations.
- In the inner loop, reinforcement learning is used to quickly learn new tasks.

What’s the key achievement?

The experiments demonstrate that the introduced approach to meta-training of RL algorithms:
- requires up to 10× fewer interaction episodes than standard meta-RL to learn comparable adaptation skills;
- can be applied to tasks with sparse rewards.

What does the AI community think?

The paper was accepted for the Spotlight presentation at NeurIPS 2019, the leading conference in artificial intelligence.

What are future research areas?

Investigating applications of the presented approach in real-world settings.

What are possible business applications?

The introduced meta-RL algorithm with supervised imitation can be applied:
- in domains with visual observations;
- in physical robotic systems.

Where can you get implementation code?

The implementation code for this paper is available on GitHub.

8. Using a Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning, by Harm van Seijen, Mehdi Fatemi, Arash Tavakoli

Original Abstract

In an effort to better understand the different ways in which the discount factor affects the optimization process in reinforcement learning, we designed a set of experiments to study each effect in isolation. Our analysis reveals that the common perception that poor performance of low discount factors is caused by (too) small action-gaps requires revision. We propose an alternative hypothesis, which identifies the size-difference of the action-gap across the state-space as the primary cause. We then introduce a new method that enables more homogeneous action-gaps by mapping value estimates to a logarithmic space. We prove convergence for this method under standard assumptions and demonstrate empirically that it indeed enables lower discount factors for approximate reinforcement-learning methods. This in turn allows tackling a class of reinforcement-learning problems that are challenging to solve with traditional methods.

Our Summary

A discount factor plays the role of a hyperparameter in reinforcement learning by helping to avoid some optimization challenges that arise when optimizing an undiscounted objective directly. It was believed that low discount factors perform poorly because of the too-small action gaps (i.e., the difference between the values of best and the second-best actions). In this paper, the authors show that this perception needs revision, and in fact, the primary factor defining the performance of the discount factor is the size difference of the action gap. The researchers introduce a new method that ensures more homogeneous action-gap sizes for sparse-reward problems. The experiments demonstrate that this method achieves much better performance for low discount factors than previously possible, thus supporting the theoretical analysis.

What’s the core idea of this paper?

The common perception that low discount factors perform poorly because of (too) small action gaps needs revision.
A large size difference of the action gap across the state-space might be the primary factor causing poor performance of approximate RL.
The paper introduces a new method that ensures more homogeneous action-gap sizes, and thus improves the performance of the RL algorithm for low discount factors:
- The update target is mapped to a logarithmic space, and the updates are performed in this space.

What’s the key achievement?

The experiments demonstrate that a new method with more homogeneous action-gap sizes, called LogDQN, performs well even for low discount factors.
The paper’s analytical and empirical results suggest that there are tasks where low discount factors perform asymptotically better than higher ones. This implies that the introduced method can unlock to RL new, previously unachievable tasks.

What does the AI community think?

The paper was accepted for oral presentation at NeurIPS 2019, the leading conference in artificial intelligence.

What are future research areas?

Re-evaluating the other hyperparameters in the low discount factor region.

Where can you get implementation code?

The implementation code for the linear experiments of the paper as well as the deep RL Atari 2600 examples is provided on GitHub.

9. Emergent Tool Use From Multi-Agent Autocurricula, by Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, Igor Mordatch

Original Abstract

Through multi-agent competition, the simple objective of hide-and-seek, and standard reinforcement learning algorithms at scale, we find that agents create a self-supervised autocurriculum inducing multiple distinct rounds of emergent strategy, many of which require sophisticated tool use and coordination. We find clear evidence of six emergent phases in agent strategy in our environment, each of which creates a new pressure for the opposing team to adapt; for instance, agents learn to build multi-object shelters using moveable boxes which in turn leads to agents discovering that they can overcome obstacles using ramps. We further provide evidence that multi-agent competition may scale better with increasing environment complexity and leads to behavior that centers around far more human-relevant skills than other self-supervised reinforcement learning methods such as intrinsic motivation. Finally, we propose transfer and fine-tuning as a way to quantitatively evaluate targeted capabilities, and we compare hide-and-seek agents to both intrinsic motivation and random initialization baselines in a suite of domain-specific intelligence tests.

Our Summary

Replicating the ability to solve complex, human-relevant tasks in artificially intelligent agents is an ongoing challenge. In this paper, the OpenAI research team demonstrates that implicit autocurricula, whereby competing agents continually create new tasks for each other, leads to evolving agent strategies. The researchers introduce a new competitive and cooperative environment, where agents play a simple hide-and-seek game. With only a visibility-based reward function and competition, the agents were able to develop complex strategies, including collaborative tool use, barricading doors, constructing multi-object forts, and using ramps to jump into hiders’ shelters. In addition, the authors introduce a framework for evaluating agents in open-ended environments.

What’s the core idea of this paper?

In this paper, the OpenAI team wants to demonstrate that, like humans and animals, AI agents can evolve best through competition with each other.
To test this claim, they have developed a competitive and cooperative physics-based environment, where agents play a hide-and-seek game:
- Agents act independently, using their own observations and hidden memories, and are not incentivized to interact with the game environment.
- They are given a team-based reward. Specifically, hiders receive a reward if all hiders are hidden and a penalty if any hider is seen by a seeker. Seekers are given the opposite rewards.
The authors also show that agents with competitive, extrinsic motivation are far more likely to use tools and build shelters than intrinsically motivated agents.

What’s the key achievement?

Providing evidence that multi-agent autocurricula lead to the development of complex strategies and tool use.
Introducing a framework for evaluating agents in open-ended environments, including five intelligence tests to quantitatively measure agents’ capabilities.
Open-sourcing a new physics-based environment to encourage future research in this area.

What are future research areas?

Reducing sample complexity.
Further improving policy learning algorithms and architectures.
Investigating methods to create environments that better prevent unwanted agent behaviors.

What are possible business applications?

Optimization modeling for human conflicts.
Modeling complex competitive behaviors between pathogens and host immune systems.

Where can you get implementation code?

The code for environment construction is available on GitHub.

10. Solving Rubik’s Cube with a Robot Hand, by Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, Jonas Schneider, Nikolas Tezak, Jerry Tworek, Peter Welinder, Lilian Weng, Qiming Yuan, Wojciech Zaremba, Lei Zhang

Original Abstract

We demonstrate that models trained only in simulation can be used to solve a manipulation problem of unprecedented complexity on a real robot. This is made possible by two key components: a novel algorithm, which we call automatic domain randomization (ADR) and a robot platform built for machine learning. ADR automatically generates a distribution over randomized environments of ever-increasing difficulty. Control policies and vision state estimators trained with ADR exhibit vastly improved sim2real transfer. For control policies, memory-augmented models trained on an ADR-generated distribution of environments show clear signs of emergent meta-learning at test time. The combination of ADR with our custom robot platform allows us to solve a Rubik’s cube with a humanoid robot hand, which involves both control and state estimation problems. Videos summarizing our results are available: https://openai.com/blog/solving-rubiks-cube/

Our Summary

Training robots on physical systems is usually too expensive and time-consuming, while simulations are mostly not able to fully capture the environment in enough detail. However, in this paper, the OpenAI research team demonstrates that a robot trained only on simulated data can solve a real-world manipulation problem – in this case, solving a Rubik’s cube. To create simulated environments diverse enough to capture the physics of the real world, they developed a new algorithm, Automatic Domain Randomization (ADR), which generates progressively more difficult environments in simulation. A robot hand trained using this method is able to solve a real-world Rubik’s cube 20%-60% of the time, depending on the complexity of the initial state.

What’s the core idea of this paper?

The OpenAI team suggests the following framework for training a humanoid hand to solve a Rubik’s cube:
- Using Automatic Domain Randomization (ADR) to collect simulated data on a growing distribution of randomized environments.
- Training a control policy that selects the next position based on fingertip positions and the cube state using a recurrent neural network (i.e., LSTM) and reinforcement learning.
- Training a convolutional neural network (CNN), separately from a control policy, to predict the next cube state given three simulated camera images.
- Transferring the task to the real world by combining the state estimation from CNN with the control policy.
Furthermore, the researchers find signs of emergent learning at test time. They claim that this implicit meta-learning is achieved thanks to training an LSTM on an ever-growing ADR distribution.

What’s the key achievement?

The robot hand is able to successfully manipulate the Rubik’s cube to a solved state:
- 20% of the time for maximally difficult starting blocks requiring 26 face switches;
- 60% of the time for simpler scrambles that require 15 rotations.
The robot hand is robust enough to deal with perturbations during the manipulation, such as tying fingers together or putting a blanket over the cube.

What does the AI community think?

“What is exciting about this work is that the system learns. It doesn’t memorize one way to solve the problem. It learns,” said Jeff Clune, a robotics professor at the University of Wyoming.
“The work itself is impressive, but mischaracterized, and … a better title would have been ‘manipulating a Rubik’s cube using reinforcement learning’ or ‘progress in manipulation with dextrous robotic hands’” – Gary Marcus, CEO and Founder of Robust.ai, details his opinion on the achievements of this paper.
“This is an interesting and positive step forward, but it is really important not to exaggerate it,” said Ken Goldberg, a professor at the University of California, Berkeley.

What are future research areas?

Extending to general-purpose systems that can quickly adapt to the changing environment.
Improving adaptation to real-world dynamics during the first few moves of the manipulation task.

What are possible business applications?

The suggested approach to training an algorithm in a simulated environment might be used to train robotic hands for further applications in manufacturing and warehouse operations.

If you like these research summaries, you might be also interested in the following articles:

We’ll let you know when we release more summary articles like this one.

10 Important Reinforcement Learning Research Papers of 2019

1. How to Combine Tree-Search Methods in Reinforcement Learning, by Yonathan Efroni, Gal Dalal, Bruno Scherrer, Shie Mannor

Original Abstract

Our Summary

What’s the core idea of this paper?

What’s the key achievement?

What does the AI community think?

What are future research areas?

What are possible business applications?

2. Social Influence as Intrinsic Motivation for Multi-Agent Deep Reinforcement Learning, by Natasha Jaques, Angeliki Lazaridou, Edward Hughes, Caglar Gulcehre, Pedro A. Ortega, DJ Strouse, Joel Z. Leibo, Nando de Freitas

Original Abstract

Our Summary

What’s the core idea of this paper?

What’s the key achievement?

What does the AI community think?

What are future research areas?

What are possible business applications?

3. Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables, by Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, Sergey Levine

Original Abstract

Our Summary

What’s the core idea of this paper?

What’s the key achievement?

What does the AI community think?

What are possible business applications?

Where can you get implementation code?

4. Policy Certificates: Towards Accountable Reinforcement Learning, by Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill

Original Abstract

Our Summary

What’s the core idea of this paper?

What’s the key achievement?

What does the AI community think?

What are future research areas?

What are possible business applications?

5. Distributional Reinforcement Learning for Efficient Exploration, by Borislav Mavrin, Shangtong Zhang, Hengshuai Yao, Linglong Kong, Kaiwen Wu, Yaoliang Yu

Original Abstract

Our Summary

What’s the core idea of this paper?

What’s the key achievement?

What does the AI community think?

What are future research areas?

6. Better Exploration with Optimistic Actor-Critic, by Kamil Ciosek, Quan Vuong, Robert Loftin, Katja Hofmann

Original Abstract

Our Summary

What’s the core idea of this paper?

What’s the key achievement?

What does the AI community think?

What are possible business applications?

7. Guided Meta-Policy Search, by Russell Mendonca, Abhishek Gupta, Rosen Kralev, Pieter Abbeel, Sergey Levine, Chelsea Finn

Original Abstract

Our Summary

What’s the core idea of this paper?

What’s the key achievement?

What does the AI community think?

What are future research areas?

What are possible business applications?

Where can you get implementation code?

8. Using a Logarithmic Mapping to Enable Lower Discount Factors in Reinforcement Learning, by Harm van Seijen, Mehdi Fatemi, Arash Tavakoli

Original Abstract

Our Summary

What’s the core idea of this paper?

What’s the key achievement?

What does the AI community think?

What are future research areas?

Where can you get implementation code?

9. Emergent Tool Use From Multi-Agent Autocurricula, by Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, Igor Mordatch

Original Abstract

Our Summary

What’s the core idea of this paper?

What’s the key achievement?

What are future research areas?

What are possible business applications?

Where can you get implementation code?

Original Abstract

Our Summary

What’s the core idea of this paper?

What’s the key achievement?

What does the AI community think?

What are future research areas?

What are possible business applications?

Enjoy this article? Sign up for more AI research updates.