Reward Learning from Demonstration

Project Website http://acl.mit.edu/projects/rewardlearning.html

Many situations in artificial intelligence (and everyday life) involve learning a task from observed demonstrations. In robotics and autonomy, there exists a large body of literature on the topic of learning from demonstration. However, much of the robotics work has focused on generating direct functional mappings for low-level tasks. Alternatively, one might consider assuming a rational model for the demonstrator, and using the observed data to invert the model. This process can be loosely termed inverse decision making, and in practice it is often more challenging (both conceptually and computationally) than more direct mapping approaches. However, inverting the decision-making process may lend more insight as to the motivation of the demonstrator, and provide a richer explanation of the observed actions. Indeed, similar methodology has been increasingly used in psychology and cognitive science for action understanding and preference learning in humans.

If the problem is formally cast in the Markov decision process (MDP) framework, the rational model described above becomes an agent who attempts to maximize cumulative reward (in a potentially sub-optimal fashion). Inverse decision making becomes the problem of finding a state reward function that explains the observed state-action pairs of the agent, and is termed inverse reinforcement learning (IRL).

Bayesian Nonparametric Inverse Reinforcement Learning -- The Bayesian nonparametric inverse reinforcement learning (BNIRL) algorithm addresses the scalability of IRL methods to larger problems. The BNIRL algorithm automatically partitions the observed demonstrations and finds a simple reward function to explain each partition using a Bayesian nonparametric mixture model. Using simple reward functions (which can be interpreted as subgoals) for each partition eliminates the need to search over a large candidate space. Also, the number of these partitions is assumed to be unconstrained and unknown a priori, allowing the algorithm to still explain complex behavior. As a result, the algorithm is able to explain cyclic and repetitive demonstrations that break the assumptions of other IRL methods.

Scalable Reward Learning -- A Framework for Scalable Reward Learning offers several effective methods to avoid computing the optimal MDP value function, enabling BNIRL to scale to much larger problem domains. In the first method, we modify BNIRL to use a framework known as Real-time Dynamic Programming (RTDP) \cite{Barto1995}. RTDP effectively limits computation of the value function to necessary areas of the state space only. This allows the complexity of the BNIRL reward learning method to scale with the size of the demonstration set, {\it not} the size of the full state space as in \cite{Michini2012a}. Experimental results are given for a Grid World domain and show order of magnitude speedups over IRL methods using exact solvers for large grid sizes.

In the second method, we modify BNIRL to utilize an existing closed-loop controller in place of the optimal value function. This avoids having to specify a discretization of the state or action spaces, extending the applicability of BNIRL to continuous demonstration domains. Experimental results are given for a quadrotor flight example which, if discretized, would require over 1010 states. In the experiment, quadrotor flight maneuvers are learned from a human demonstrator using only hand motions. The demonstration is recorded using a motion capture system and then analyzed by the BNIRL algorithm. Learned subgoal rewards (in the form of waypoints) are passed as commands to an autonomous quadrotor which executes the learned behavior in actual flight. The entire process from demonstration to reward learning to robotic execution takes on the order of 10 seconds to complete using a single computer. Thus, the results highlight the ability of BNIRL to use data from a safe (and not necessarily dynamically feasible) demonstration environment and quickly learn subgoal rewards that can be used in the actual robotic system.