Abstract

Sim2real for robotic manipulation is difficult due to the challenges of simulating complex contacts and generating realistic task distributions. To tackle the latter problem, we introduce ManipGen, which leverages a new class of policies for sim2real transfer: local policies. Locality enables a variety of appealing properties including invariances to absolute robot and object pose, skill ordering, and global scene configuration. We combine these policies with foundation models for vision, language and motion planning and demonstrate SOTA zero-shot performance of our method to Robosuite benchmark tasks in simulation (97%). We transfer our local policies from simulation to reality and observe they can solve unseen long-horizon manipulation tasks with up to 8 stages with significant pose, object and scene configuration variation. ManipGen outperforms SOTA approaches such as SayCan, OpenVLA, LLMTrajGen and VoxPoser across 50 real-world manipulation tasks by 36%, 76%, 62% and 60% respectively.



ManipGen Overview

Image description

We present ManipGen, which consists of 3 main components. Left: Train 1000s of RL experts in simulation using PPO Middle: Distill single-task RL experts into generalist visuomotor policies via DAgger Right: Text-conditioned long-horizon manipulation via task decomposition (VLM), pose estimation and goal reaching (Motion Planning) and sim2real transfer of local policies


All videos play at 8x speed

ManipGen solves complex, long-horizon tasks in the real-world entirely zero-shot generalizing across objects, poses, environments and scene configurations

CabinetStore (4 stages): 80%

DrawerStore (6 stages): 60%

Cook (2 stages): 90%

Replace (4 stages): 80%

Tidy (8 stages): 60%

ManipGen can perform zero-shot manipulation in tight spaces and with clutter

Shelf object manipulation

Picking pepper from clutter

Picking the screwdriver around the books

Placing the large sunscreen bottle in the drawer

ManipGen can manipulate objects with entirely unseen geometries

Orange/Black clip

HDMI cable

Pliers

Scotch tape

ManipGen generalizes to manipulating deformable objects

Paper towel

Soft toy

Trashbag

Coffee and biscuit

ManipGen generalizes to unseen manipulation receptacles

Horizontal dishrack bars under bowl

Spirals on the stove

Picking from clutter and circular bowl surface

Horizontal bars under razor box

ManipGen trains local skills at scale in simulation that transfer to the real world

Pick

Place

Grasp Handle

Open

Close

We train on 3.5K+ objects from UnidexGrasp and 2.5K+ objects from PartNet

Image description

UnidexGrasp (pick/place)

Image description

PartNet (grasp handle/open/close)

Failure Cases (4x speed)

Current VLMs are not always good at spatial/physical reasoning. Here GPT-4o predicts the incorrect plan ordering which results in a failure: predicting pick then open instead of open then pick.

Existing openset detection methods can be brittle when given challenging prompts (from the VLM). Grounded SAM segments the tray instead of the spice box due to a failure in Grounding Dino.

In some rare cases, especially with more oblong objects, perception/motion planning failures cause Neural MP to collide with the environment when executing, dropping the object.

While our policies are trained to handle moderates amount of clutter, tightly packed clutter is still a challenge for pick. Here a local policy grasp failure results in task failure.

BibTeX

@article{dalal2024manipgen,
title={Local Policies Enable Zero-shot Long-horizon Manipulation},
author={Murtaza Dalal and Min Liu and Walter Talbott and Chen Chen and Deepak Pathak and Jian Zhang and Ruslan Salakhutdinov},
journal = {arXiv preprint arXiv:2410.22332},
year={2024},
}