More research

Hierarchical World Models as
Visual Whole-Body Humanoid Controllers

Nicklas Hansen1,  Jyothir S V2,  Vlad Sobal2,  Yann LeCun2,3, 
Xiaolong Wang1*,  Hao Su1*

UC San Diego, NYU, Meta AI
*Equal advising

Visual whole-body control for humanoids. We present Puppeteer, a hierarchical world model for whole-body humanoid control with visual observations. Our method produces natural and human-like motions without any reward design or skill primitives, and traverses challenging terrain.

Abstract

Whole-body control for humanoids is challenging due to the high-dimensional nature of the problem, coupled with the inherent instability of a bipedal morphology. Learning from visual observations further exacerbates this difficulty. In this work, we explore highly data-driven approaches to visual whole-body humanoid control based on reinforcement learning, without any simplifying assumptions, reward design, or skill primitives. Specifically, we propose a hierarchical world model in which a high-level agent generates commands based on visual observations for a low-level agent to execute, both of which are trained with rewards. Our approach produces highly performant control policies in 8 tasks with a simulated 56-DoF humanoid, while synthesizing motions that are broadly preferred by humans.

Qualitative results

Our method produces natural and human-like motions across a variety of visual whole-body humanoid control tasks. A strong baseline, TD-MPC2, achieves comparable performance in terms of reward, but produces unnatural behaviors.

Ours
TD-MPC2
Ours
TD-MPC2

Zero-shot generalization

We evaluate agents trained on gap lengths of 0.1m to 0.4m on unseen gap lengths of up to 1.2m. Our method achieves non-trivial performance across all unseen gap lengths.

Tracking results

Our low-level tracking agent is a single agent trained to track a total of 836 human Motion Capture (MoCap) clips, purely using RL and without any bells and whistles. The agent is able to track a wide variety of motions retargeted to a 56-DoF simulated humanoid.

Benchmarking

Our method learns highly performant RL policies across both state-based and visual whole-body control tasks, while producing natural and human-like motions that are broadly preferred by humans. SAC and DreamerV3 do not achieve meaningful performance on these tasks, neither as standalone algorithms nor as high-level controllers for our low-level tracking world model. TD-MPC2 achieves comparable performance in terms of reward, but produces unnatural behaviors.

Human preference in humanoid motions

Aggregate results from a user study (n=51) where participants are presented with pairs of motions generated by TD-MPC2 and our method, and are asked to provide their preference. Motions generated by our method are broadly preferred by humans.

Paper

Hierarchical World Models as Visual Whole-Body Humanoid Controllers
Nicklas Hansen, Jyothir S V, Vlad Sobal, Yann LeCun, Xiaolong Wang, Hao Su

arXiv preprint

View on arXiv

Citation

If you find our work useful, please consider citing the paper as follows:

@misc{hansen2025hierarchical, title={Hierarchical World Models as Visual Whole-Body Humanoid Controllers}, author={Nicklas Hansen, Jyothir S V, Vlad Sobal, Yann LeCun, Xiaolong Wang, Hao Su}, booktitle={International Conference on Learning Representations (ICLR)}, year={2025} }