ManiSkill-HAB

A Benchmark for Low-Level Manipulation in Home Rearrangement Tasks

Preprint
Hillbot Inc. and UC San Diego

Overview: ManiSkill-HAB provides a GPU-accelerated implementation of the Home Assistant Benchmark (HAB) which supports realistic low-level control, extensive RL and IL baselines, and a rule-based trajectory filtering system to enable efficient, controlled data generation at scale.

Abstract

High-quality benchmarks are the foundation for embodied AI research, enabling significant advancements in long-horizon navigation, manipulation and rearrangement tasks. However, as frontier tasks in robotics get more advanced, they require faster simulation speed, more intricate test environments, and larger demonstration datasets. To this end, we present MS-HAB, a holistic benchmark for low-level manipulation and in-home object rearrangement. First, we provide a GPU-accelerated implementation of the Home Assistant Benchmark (HAB). We support realistic low-level control and achieve over 3× the speed of previous magical grasp implementations at similar GPU memory usage. Second, we train extensive reinforcement learning (RL) and imitation learning (IL) baselines for future work to compare against. Finally, we develop a rule-based trajectory filtering system to sample specific demonstrations from our RL policies which match predefined criteria for robot behavior and safety. Combining demonstration filtering with our fast environments enables efficient, controlled data generation at scale.

Parallelized GPU Simulation and Rendering

By scaling parallel environments with GPU simulation, MS-HAB achieves 4000 samples per second on a benchmark involving representative interaction with dynamic objects — 3× Habitat 2.0's implementation at similar GPU memory usage.

MS-HAB environments support realistic low-level control for successful grasping, manipulation, and interaction, while the Habitat 2.0 environments do not support such kind of low-level control.

This means MS-HAB is fast enough to support online training and efficient, extensive evaluation without sacrificing physical realism.

Extensive RL and IL Baselines with Whole-Body Control

To solve the HAB's long-horizon tasks (TidyHouse, PrepareGroceries, SetTable), MS-HAB chains individual skill policies (Pick, Place, Open, and Close). For each skill, MS-HAB provides extensive reinforcement learning (RL) and imitation learning (IL) baselines which use whole-body control, i.e. manipulation and navigation performed simultaneously.




We find that, despite signficant tuning, our baselines are unable to solve the MS-HAB tasks. In particular, individual subtask success rate (Pick, Place, Open, and Close) all leave significant room for improvement. This indicates our task is not yet saturated, and there is room for future work to improve performance.

Efficient, Controlled Data Generation at Scale

We develop a rule-based event labeling and trajectory categorization system to filter for specific demonstrations which match predefined criteria for robot behavior and safety. We provide these tools so users can generate data with custom requirements. We use this filtering system to generate a large vision-based robot dataset to train our IL policies.

2 128×128 RGBD Pixels + State  |  1000 episodes per target obj/articulation  |  Event labeling performed on all trajectories
Long-Horizon Task Subtasks Episodes Transitions Size Link
TidyHouse Pick (M*), Place (M~H) 18K 3.6M 208.5 GB Download
PrepareGroceries Pick (H), Place (H) 18K 3.6M 174.2 GB Download
SetTable Pick (E), Place (E),
Open (E), Close (E)
8K 1.6M 83.4 GB Download
*Approxmate Subtask Difficulty (based on randomizations, receptacles, etc): Easy — E  |  Medium — M  |  Hard — H

Furthermore, we use the trajectory categorization system to group rollouts by success/failure modes. We provide these statistics in the paper appendix to provide the community with clearer insight on avenues for improvement beyond raw success rates.

Supporting Open Source Science

All environments, code, checkpoints, and datasets are open-sourced for the community to use. We will continue to support the MS-HAB benchmark with performance improvments, features, baselines, and tools. If you'd like to request a feature or contribute, please check out the GitHub repository!

SVG Code SVG Models SVG Dataset
Whole-body low-level control under constraints in cluttered environments, long-horizon skill chaining, and scene-level rearrangement are challenging for current robot learning methods; we hope our benchmark and dataset aid the community in advancing these research areas.

Citation

If you use ManiSkill-HAB in your work, please consider citing the following:

SVG

@article{shukla2024maniskillhab,

author={Arth Shukla and Stone Tao and Hao Su},

title = {ManiSkill-HAB: A Benchmark for Low-Level Manipulation in Home Rearrangement Tasks},

journal = {CoRR},

volume = {abs/2412.13211},

year = {2024},

url = {https://doi.org/10.48550/arXiv.2412.13211},

doi = {10.48550/ARXIV.2412.13211},

eprinttype = {arXiv},

eprint = {2412.13211},

timestamp = {Mon, 09 Dec 2024 01:29:24 +0100},

biburl = {https://dblp.org/rec/journals/corr/abs-2412-13211.bib},

bibsource = {dblp computer science bibliography, https://dblp.org}

}