Laura Smith, Nikita Dhawan, Marvin Zhang, Pieter Abbeel, Sergey Levine.
paper / website / blog / talk
AVID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos.
Robotic reinforcement learning (RL) holds the promise of
enabling robots to learn complex behaviors through experience.
However, realizing this promise for long-horizon tasks in the
real world requires mechanisms to reduce human burden in terms
of defining the task and scaffolding the learning process. In
this paper, we study how these challenges can be alleviated with
an automated robotic learning framework, in which multi-stage
tasks are defined simply by providing videos of a human
demonstrator and then learned autonomously by the robot from raw
image observations. A central challenge in imitating human
videos is the difference in appearance between the human and
robot, which typically requires manual correspondence. We
instead take an automated approach and perform pixel-level image
translation via CycleGAN to convert the human demonstration into
a video of a robot, which can then be used to construct a reward
function for a model-based RL algorithm. The robot then learns
the task one stage at a time, automatically learning how to
reset each stage to retry it multiple times without
human-provided resets. This makes the learning process largely
automatic, from intuitive task specification via a video to
automated training with minimal human intervention. We
demonstrate that our approach is capable of learning complex
tasks, such as operating a coffee machine, directly from raw
image observations, requiring only 20 minutes to provide human
demonstrations and about 180 minutes of robot interaction.
Michael Janner, Justin Fu, Marvin Zhang, Sergey Levine.
paper / website / code / blog / talk
When to Trust Your Model: Model-Based Policy Optimization.
Designing effective model-based reinforcement learning
algorithms is difficult because the ease of data generation must
be weighed against the bias of model-generated data. In this
paper, we study the role of model usage in policy optimization
both theoretically and empirically. We first formulate and
analyze a model-based reinforcement learning algorithm with a
guarantee of monotonic improvement at each step. In practice,
this analysis is overly pessimistic and suggests that real
off-policy data is always preferable to model-generated
on-policy data, but we show that an empirical estimate of model
generalization can be incorporated into such analysis to justify
model usage. Motivated by this analysis, we then demonstrate
that a simple procedure of using short model-generated rollouts
branched from real data has the benefits of more complicated
model-based algorithms without the usual pitfalls. In
particular, this approach surpasses the sample efficiency of
prior model-based methods, matches the asymptotic performance of
the best model-free algorithms, and scales to horizons that
cause other model-based methods to fail entirely.
Marvin Zhang*, Sharad Vikram*, Laura Smith, Pieter Abbeel, Matthew Johnson, Sergey Levine.
paper / website / code / blog / talk
SOLAR: Deep Structured Latent Representations for Model-Based Reinforcement Learning.
Model-based reinforcement learning (RL) has proven to be a data
efficient approach for learning control tasks but is difficult
to utilize in domains with complex observations such as images.
In this paper, we present a method for learning representations
that are suitable for iterative model-based policy improvement,
even when the underlying dynamical system has complex dynamics
and image observations, in that these representations are
optimized for inferring simple dynamics and cost models given
data from the current policy. This enables a model-based RL
method based on the linear-quadratic regulator (LQR) to be used
for systems with image observations. We evaluate our approach on
a range of robotics tasks, including manipulation with a
real-world robotic arm directly from images. We find that our
method produces substantially better final performance than
other model-based RL methods while being significantly more
efficient than model-free RL.
Yevgen Chebotar*, Karol Hausman*, Marvin Zhang*, Gaurav Sukhatme, Stefan Schaal, Sergey Levine.
paper / website / code
Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning.
Reinforcement learning (RL) algorithms for real-world robotic
applications need a data-efficient learning process and the
ability to handle complex, unknown dynamical systems. These
requirements are handled well by model-based and model-free RL
approaches, respectively. In this work, we aim to combine the
advantages of these two types of methods in a principled manner.
By focusing on time-varying linear-Gaussian policies, we enable
a model-based algorithm based on the linear quadratic regulator
(LQR) that can be integrated into the model-free framework of
path integral policy improvement (PI2). We can further combine
our method with guided policy search (GPS) to train arbitrary
parameterized policies such as deep neural networks. Our
simulation and real-world experiments demonstrate that this
method can solve challenging manipulation tasks with comparable
or better performance than model-free methods while maintaining
the sample efficiency of model-based methods.
Marvin Zhang*, Xinyang Geng*, Jonathan Bruce*, Ken Caluwaerts, Massimo Vespignani, Vytas SunSpiral, Pieter Abbeel, Sergey Levine.
paper / website / code
Deep Reinforcement Learning for Tensegrity Robot Locomotion.
Tensegrity robots, composed of rigid rods connected by elastic
cables, have a number of unique properties that make them
appealing for use as planetary exploration rovers. However,
control of tensegrity robots remains a difficult problem due to
their unusual structures and complex dynamics. In this work, we
show how locomotion gaits can be learned automatically using a
novel extension of mirror descent guided policy search (MDGPS)
applied to periodic locomotion movements, and we demonstrate the
effectiveness of our approach on tensegrity robot locomotion. We
evaluate our method with real-world and simulated experiments on
the SUPERball tensegrity robot, showing that the learned
policies generalize to changes in system parameters, unreliable
sensor measurements, and variation in environmental conditions,
including varied terrains and a range of different gravities.
Our experiments demonstrate that our method not only learns
fast, power-efficient feedback policies for rolling gaits, but
that these policies can succeed with only the limited onboard
sensing provided by SUPERball's accelerometers. We compare the
learned feedback policies to learned open-loop policies and
hand-engineered controllers, and demonstrate that the learned
policy enables the first continuous, reliable locomotion gait
for the real SUPERball robot.
Marvin Zhang, Zoe McCarthy, Chelsea Finn, Sergey Levine, Pieter Abbeel.
Learning Deep Neural Network Policies with Continuous Memory States.
Policy learning for partially observed control tasks requires
policies that can remember salient information from past
observations. In this paper, we present a method for learning
policies with internal memory for high-dimensional, continuous
systems, such as robotic manipulators. Our approach consists of
augmenting the state and action space of the system with
continuous-valued memory states that the policy can read from
and write to. Learning general-purpose policies with this type
of memory representation directly is difficult, because the
policy must automatically figure out the most salient
information to memorize at each time step. We show that, by
decomposing this policy search problem into a trajectory
optimization phase and a supervised learning phase through a
method called guided policy search, we can acquire policies with
effective memorization and recall strategies. Intuitively, the
trajectory optimization phase chooses the values of the memory
states that will make it easier for the policy to produce the
right action in future states, while the supervised learning
phase encourages the policy to use memorization actions to
produce those memory states. We evaluate our method on tasks
involving continuous control in manipulation and navigation
settings, and show that our method can learn complex policies
that successfully complete a range of tasks that require memory.