ChatPaper.aiChatPaper

透過殘差潛在動作學習基於視覺特徵的世界模型

Learning Visual Feature-Based World Models via Residual Latent Action

May 8, 2026
作者: Xinyu Zhang, Zhengtong Xu, Yutian Tao, Yeping Wang, Yu She, Abdeslam Boularias
cs.AI

摘要

世界模型透過觀測與動作預測未來狀態的轉變。現有研究大多僅專注於圖像生成。而基於視覺特徵的世界模型則預測未來視覺特徵,而非原始視頻像素,提供了一種更高效且不易產生幻覺的替代方案。然而,當前的特徵基礎方法依賴於直接回歸,導致在複雜互動中產生模糊或崩潰的預測,而在高維特徵空間中進行生成式建模仍具挑戰性。在本工作中,我們發現一種新型潛在動作表征(稱為*殘差潛在動作*,RLA)可輕易從DINO殘差中學習。我們也展示了RLA具有預測性、可泛化性,並能編碼時間進展。基於RLA,我們提出了*RLA世界模型*(RLA-WM),該模型透過流匹配預測RLA值。RLA-WM在模擬與真實世界數據集上均優於先進的基於特徵與視頻擴散世界模型,同時運算速度比視頻擴散快數個數量級。此外,我們開發了兩種利用RLA-WM提升策略學習的機器人學習技術。第一種是結合RLA的極簡世界動作模型,可從無動作示範影片中學習。第二種則是首個完全在僅從離線影片學習的世界模型內部訓練的視覺強化學習框架,該框架使用與視頻對齊的獎勵,無需任何線上互動或手工設計的獎勵。專案頁面:https://mlzxy.github.io/rla-wm
English
World models predict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature-based world models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature-based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high-dimensional feature spaces still remains challenging. In this work, we discover that a new type of latent action representation, which we refer to as *Residual Latent Action* (RLA), can be easily learned from DINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose *RLA World Model* (RLA-WM), which predicts RLA values via flow matching. RLA-WM outperforms both state-of-the-art feature-based and video-diffusion world models on simulation and real-world datasets, while being orders of magnitude faster than video diffusion. Furthermore, we develop two robot learning techniques that use RLA-WM to improve policy learning. The first one is a minimalist world action model with RLA that learns from actionless demonstration videos. The second one is the first visual RL framework trained entirely inside a world model learned from offline videos only, using a video-aligned reward and no online interactions or handcrafted rewards. Project page: https://mlzxy.github.io/rla-wm
PDF11May 12, 2026