Marvin Zhang, Sergey Levine, Chelsea Finn.
paper / code
MEMO: Test Time Robustness via Adaptation and Augmentation.
Under review at ICML 2022.
While deep neural networks can attain good accuracy on
in-distribution test points, many applications require
robustness even in the face of unexpected perturbations in the
input, changes in the domain, or other sources of distribution
shift. We study the problem of test time robustification, i.e.,
using the test input to improve model robustness. Recent prior
works have proposed methods for test time adaptation, however,
they each introduce additional assumptions, such as access to
multiple test points, that prevent widespread adoption. In this
work, we aim to study and devise methods that make no
assumptions about the model training process and are broadly
applicable at test time. We propose a simple approach that can
be used in any test setting where the model is probabilistic and
adaptable: when presented with a test example, perform different
data augmentations on the data point, and then adapt (all of)
the model parameters by minimizing the entropy of the model's
average, or marginal, output distribution across the
augmentations. Intuitively, this objective encourages the model
to make the same prediction across different augmentations, thus
enforcing the invariances encoded in these augmentations, while
also maintaining confidence in its predictions. In our
experiments, we evaluate two baseline ResNet models, two robust
ResNet-50 models, and a robust vision transformer model, and we
demonstrate that this approach achieves accuracy gains of 1-8%
over standard model evaluation and also generally outperforms
prior augmentation and adaptation strategies. For the setting in
which only one test point is available, we achieve
state-of-the-art results on the ImageNet-C, ImageNet-R, and,
among ResNet-50 models, ImageNet-A distribution shift
Marvin Zhang*, Henrik Marklund*, Nikita Dhawan*, Abhishek Gupta, Sergey Levine, Chelsea Finn.
paper / website / code
Adaptive Risk Minimization: Learning to Adapt to Domain Shift.
A fundamental assumption of most machine learning algorithms is
that the training and test data are drawn from the same
underlying distribution. However, this assumption is violated in
almost all practical applications: machine learning systems are
regularly tested under distribution shift, due to changing
temporal correlations, atypical end users, or other factors. In
this work, we consider the problem setting of domain
generalization, where the training data are structured into
domains and there may be multiple test time shifts,
corresponding to new domains or domain distributions. Most prior
methods aim to learn a single robust model or invariant feature
space that performs well on all domains. In contrast, we aim to
learn models that adapt at test time to domain shift using
unlabeled test points. Our primary contribution is to introduce
the framework of adaptive risk minimization (ARM), in which
models are directly optimized for effective adaptation to shift
by learning to adapt on the training domains. Compared to prior
methods for robustness, invariance, and adaptation, ARM methods
provide performance gains of 1-4% test accuracy on a number of
image classification problems exhibiting domain shift.
Pang Wei Koh*, Shiori Sagawa*, Henrik Marklund, Sang Michael Xie, Marvin Zhang, et al.
paper / website / code
WILDS: A Benchmark of in-the-Wild Distribution Shifts.
Distribution shifts -- where the training distribution differs
from the test distribution -- can substantially degrade the
accuracy of machine learning (ML) systems deployed in the wild.
Despite their ubiquity in the real-world deployments, these
distribution shifts are under-represented in the datasets widely
used in the ML community today. To address this gap, we present
WILDS, a curated benchmark of 10 datasets reflecting a diverse
range of distribution shifts that naturally arise in real-world
applications, such as shifts across hospitals for tumor
identification; across camera traps for wildlife monitoring; and
across time and location in satellite imaging and poverty
mapping. On each dataset, we show that standard training yields
substantially lower out-of-distribution than in-distribution
performance. This gap remains even with models trained by
existing methods for tackling distribution shifts, underscoring
the need for new methods for training models that are more
robust to the types of distribution shifts that arise in
practice. To facilitate method development, we provide an
open-source package that automates dataset loading, contains
default model architectures and hyperparameters, and
Laura Smith, Nikita Dhawan, Marvin Zhang, Pieter Abbeel, Sergey Levine.
paper / website / blog / talk
AVID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos.
Robotic reinforcement learning (RL) holds the promise of
enabling robots to learn complex behaviors through experience.
However, realizing this promise for long-horizon tasks in the
real world requires mechanisms to reduce human burden in terms
of defining the task and scaffolding the learning process. In
this paper, we study how these challenges can be alleviated with
an automated robotic learning framework, in which multi-stage
tasks are defined simply by providing videos of a human
demonstrator and then learned autonomously by the robot from raw
image observations. A central challenge in imitating human
videos is the difference in appearance between the human and
robot, which typically requires manual correspondence. We
instead take an automated approach and perform pixel-level image
translation via CycleGAN to convert the human demonstration into
a video of a robot, which can then be used to construct a reward
function for a model-based RL algorithm. The robot then learns
the task one stage at a time, automatically learning how to
reset each stage to retry it multiple times without
human-provided resets. This makes the learning process largely
automatic, from intuitive task specification via a video to
automated training with minimal human intervention. We
demonstrate that our approach is capable of learning complex
tasks, such as operating a coffee machine, directly from raw
image observations, requiring only 20 minutes to provide human
demonstrations and about 180 minutes of robot interaction.
Michael Janner, Justin Fu, Marvin Zhang, Sergey Levine.
paper / website / code / blog / talk
When to Trust Your Model: Model-Based Policy Optimization.
Designing effective model-based reinforcement learning
algorithms is difficult because the ease of data generation must
be weighed against the bias of model-generated data. In this
paper, we study the role of model usage in policy optimization
both theoretically and empirically. We first formulate and
analyze a model-based reinforcement learning algorithm with a
guarantee of monotonic improvement at each step. In practice,
this analysis is overly pessimistic and suggests that real
off-policy data is always preferable to model-generated
on-policy data, but we show that an empirical estimate of model
generalization can be incorporated into such analysis to justify
model usage. Motivated by this analysis, we then demonstrate
that a simple procedure of using short model-generated rollouts
branched from real data has the benefits of more complicated
model-based algorithms without the usual pitfalls. In
particular, this approach surpasses the sample efficiency of
prior model-based methods, matches the asymptotic performance of
the best model-free algorithms, and scales to horizons that
cause other model-based methods to fail entirely.
Marvin Zhang*, Sharad Vikram*, Laura Smith, Pieter Abbeel, Matthew Johnson, Sergey Levine.
paper / website / code / blog / talk
SOLAR: Deep Structured Latent Representations for Model-Based Reinforcement Learning.
Model-based reinforcement learning (RL) has proven to be a data
efficient approach for learning control tasks but is difficult
to utilize in domains with complex observations such as images.
In this paper, we present a method for learning representations
that are suitable for iterative model-based policy improvement,
even when the underlying dynamical system has complex dynamics
and image observations, in that these representations are
optimized for inferring simple dynamics and cost models given
data from the current policy. This enables a model-based RL
method based on the linear-quadratic regulator (LQR) to be used
for systems with image observations. We evaluate our approach on
a range of robotics tasks, including manipulation with a
real-world robotic arm directly from images. We find that our
method produces substantially better final performance than
other model-based RL methods while being significantly more
efficient than model-free RL.
Yevgen Chebotar*, Karol Hausman*, Marvin Zhang*, Gaurav Sukhatme, Stefan Schaal, Sergey Levine.
paper / website / code
Combining Model-Based and Model-Free Updates for Trajectory-Centric Reinforcement Learning.
Reinforcement learning (RL) algorithms for real-world robotic
applications need a data-efficient learning process and the
ability to handle complex, unknown dynamical systems. These
requirements are handled well by model-based and model-free RL
approaches, respectively. In this work, we aim to combine the
advantages of these two types of methods in a principled manner.
By focusing on time-varying linear-Gaussian policies, we enable
a model-based algorithm based on the linear quadratic regulator
(LQR) that can be integrated into the model-free framework of
path integral policy improvement (PI2). We can further combine
our method with guided policy search (GPS) to train arbitrary
parameterized policies such as deep neural networks. Our
simulation and real-world experiments demonstrate that this
method can solve challenging manipulation tasks with comparable
or better performance than model-free methods while maintaining
the sample efficiency of model-based methods.
Marvin Zhang*, Xinyang Geng*, Jonathan Bruce*, Ken Caluwaerts, Massimo Vespignani, Vytas SunSpiral, Pieter Abbeel, Sergey Levine.
paper / website / code
Deep Reinforcement Learning for Tensegrity Robot Locomotion.
Tensegrity robots, composed of rigid rods connected by elastic
cables, have a number of unique properties that make them
appealing for use as planetary exploration rovers. However,
control of tensegrity robots remains a difficult problem due to
their unusual structures and complex dynamics. In this work, we
show how locomotion gaits can be learned automatically using a
novel extension of mirror descent guided policy search (MDGPS)
applied to periodic locomotion movements, and we demonstrate the
effectiveness of our approach on tensegrity robot locomotion. We
evaluate our method with real-world and simulated experiments on
the SUPERball tensegrity robot, showing that the learned
policies generalize to changes in system parameters, unreliable
sensor measurements, and variation in environmental conditions,
including varied terrains and a range of different gravities.
Our experiments demonstrate that our method not only learns
fast, power-efficient feedback policies for rolling gaits, but
that these policies can succeed with only the limited onboard
sensing provided by SUPERball's accelerometers. We compare the
learned feedback policies to learned open-loop policies and
hand-engineered controllers, and demonstrate that the learned
policy enables the first continuous, reliable locomotion gait
for the real SUPERball robot.
Marvin Zhang, Zoe McCarthy, Chelsea Finn, Sergey Levine, Pieter Abbeel.
Learning Deep Neural Network Policies with Continuous Memory States.
Policy learning for partially observed control tasks requires
policies that can remember salient information from past
observations. In this paper, we present a method for learning
policies with internal memory for high-dimensional, continuous
systems, such as robotic manipulators. Our approach consists of
augmenting the state and action space of the system with
continuous-valued memory states that the policy can read from
and write to. Learning general-purpose policies with this type
of memory representation directly is difficult, because the
policy must automatically figure out the most salient
information to memorize at each time step. We show that, by
decomposing this policy search problem into a trajectory
optimization phase and a supervised learning phase through a
method called guided policy search, we can acquire policies with
effective memorization and recall strategies. Intuitively, the
trajectory optimization phase chooses the values of the memory
states that will make it easier for the policy to produce the
right action in future states, while the supervised learning
phase encourages the policy to use memorization actions to
produce those memory states. We evaluate our method on tasks
involving continuous control in manipulation and navigation
settings, and show that our method can learn complex policies
that successfully complete a range of tasks that require memory.