Abstract

A rich representation is key to general robotic manipulation, but existing approaches to learning one require a lot of multimodal demonstrations. In this work we propose PLEX, a transformer-based architecture that learns from a small amount of task-agnostic visuomotor trajectories accompanied by a much larger amount of task-conditioned object manipulation videos - a type of data that is available in quantity. The key insight behind PLEX is that visuomotor trajectories help induce a latent feature space and train a robot to execute task-agnostic manipulation routines, while diverse video-only demonstrations can efficiently teach the robot how to plan in this feature space for a wide variety of tasks. In contrast to most works on robotic manipulation pretraining, PLEX learns a generalizable sensorimotor multi-task policy, not just an observational representation. We also show that using relative positional encoding in PLEX’s transformers greatly helps PLEX in low-data regimes when learning from human-collected demonstrations. Experiments showcase PLEX’s generalization on the Meta-World-v2 benchmark and SOTA performance in challenging Robosuite environments.

overview

PLEX hyperparameter tuning experiments

overview

MetaWorld playdata videos

bin-picking-v2
box-close-v2
door-lock-v2
door-unlock-v2
hand-insert-v2

Tasks on a real WidowX250 robot

pick-and-place
push-into-sink
lift-pan