We present TD-MPC, a framework for model predictive control (MPC) using a Task-Oriented Latent Dynamics (TOLD) model and terminal value function learned jointly by temporal difference learning. Our method compares favorably to prior model-free and model-based methods and solves high-dimensional Humanoid and Dog locomotion tasks in 1M environment steps (see above). This is, to the best of our knowledge, the first documented result solving the challenging Dog tasks.
Abstract
Data-driven model predictive control has two key advantages over model-free methods: a potential for improved sample efficiency through model learning, and better performance as computational budget for planning increases. However, it is both costly to plan over long horizons and challenging to obtain an accurate model of the environment. In this work, we combine the strengths of model-free and model-based methods. We use a learned task-oriented latent dynamics model for local trajectory optimization over a short horizon, and use a learned terminal value function to estimate long-term return, both of which are learned jointly by temporal difference learning. Our method, TD-MPC, achieves superior sample efficiency and asymptotic performance over prior work on both state and image-based continuous control tasks from DMControl and Meta-World.
DMControl
Meta-World
Multi-modal RL
Planning with TD-MPC
Training a Task-Oriented Latent Dynamics Model
Benchmark Results
Comparison to Related Work
Citation
If you use our method or code in your research, please consider citing the paper as follows: