Imitating Task and Motion Planning with Visuomotor Transformers

Published at Conference on Robot Learning (CoRL) 2023
Murtaza Dalal1    Ajay Mandlekar*2    Caelan Garrett*2    Ankur Handa2    Ruslan Salakhutdinov1    Dieter Fox2
1Carnegie Mellon University          2NVIDIA
*Denotes equal contribution


Imitation learning is a powerful tool for training robot manipulation policies, allowing them to learn from expert demonstrations without manual programming or trial-and-error. However, common methods of data collection, such as human supervision, scale poorly, as they are time-consuming and labor-intensive. In contrast, Task and Motion Planning (TAMP) can autonomously generate large-scale datasets of diverse demonstrations. In this work, we show that the combination of large-scale datasets generated by TAMP supervisors and flexible Transformer models to fit them is a powerful paradigm for robot manipulation. To this end, we present a novel imitation learning system called Optimus that trains large-scale visuomotor Transformer policies by imitating a TAMP agent. Optimus introduces a pipeline for generating TAMP data that is specifically curated for imitation learning and can be used to train performant transformer-based policies. We demonstrate that Optimus can solve a wide variety of challenging vision-based manipulation tasks with over 70 different objects, ranging from long-horizon and pick-and-place tasks, to shelf and articulated object manipulation, achieving 70 to 80% success rates.

Offline Pretrained TAMP Imitation System

Image description

We generate a variety of tasks with differing initial configurations and goals. Next, we transform TAMP joint space demonstrations to task space, go from privileged scene knowledge in TAMP to visual observations and prune TAMP demonstrations based on workspace constraints. Finally, we perform large-scale behavior cloning using a Transformer-based architecture and execute the visuomotor policies.

Optimus enables visuomotor policies to solve manipulation tasks with up to 8 stages

Data Collection Results

Optimus can solve tasks requiring obstacle awareness and skills beyond pick-and-place.

Optimus can distill TAMP's task planning and scene generalization capabilities.

Data Collection Results

Large Scale Evaluation

Optimus produces TAMP supervised visuomotor policies that can solve over 300 manipulation tasks with up to 72 different objects, achieving success rates of over 70%


    title={Imitating Task and Motion Planning with Visuomotor Transformers},
    author={Dalal, Murtaza and Mandlekar, Ajay and Garrett, Caelan and Handa, Ankur and Salakhutdinov, Ruslan and Fox, Dieter},
    journal={Conference on Robot Learning},


We thank Russell Mendonca, Sudeep Dasari, Theophile Gervet and Brandon Trabucco for feedback on early drafts of this paper. This work was supported in part by ONR N000141812861, ONR N000142312368 and DARPA/AFRL FA87502321015. Additionally, MD is supported by the NSF Graduate Fellowship.