ChatPaper.aiChatPaper

通过残差潜行动作学习基于视觉特征的世界模型

Learning Visual Feature-Based World Models via Residual Latent Action

May 8, 2026
作者: Xinyu Zhang, Zhengtong Xu, Yutian Tao, Yeping Wang, Yu She, Abdeslam Boularias
cs.AI

摘要

世界模型通过观察和动作预测未来状态的变化。现有工作主要集中于图像生成。相比之下,基于视觉特征的世界模型预测未来视觉特征而非原始视频像素,提供了一种更高效且不易产生幻觉的替代方案。然而,当前基于特征的方法依赖直接回归,这会导致在复杂交互中产生模糊或崩溃的预测,而在高维特征空间中进行生成建模仍然具有挑战性。在本工作中,我们发现一种新型潜在动作表示——我们称之为**残差潜在动作**(Residual Latent Action, RLA)——可以轻松地从DINO残差中学习得到。我们还证明RLA具有预测性、可泛化性,并能编码时间进程。基于RLA,我们提出**RLA世界模型**(RLA-WM),该模型通过流匹配预测RLA值。RLA-WM在仿真和真实世界数据集上均优于最先进的基于特征的世界模型和视频扩散世界模型,同时速度比视频扩散快数个数量级。此外,我们开发了两种利用RLA-WM改进策略学习的机器人技术。第一种是使用RLA的最小化世界动作模型,可从无动作演示视频中学习。第二种是首个完全在仅由离线视频学习的世界模型内部训练的视觉强化学习框架,使用视频对齐的奖励,无需在线交互或手工设计的奖励。项目页面:https://mlzxy.github.io/rla-wm
English
World models predict future transitions from observations and actions. Existing works predominantly focus on image generation only. Visual feature-based world models, on the other hand, predict future visual features instead of raw video pixels, offering a promising alternative that is more efficient and less prone to hallucination. However, current feature-based approaches rely on direct regression, which leads to blurry or collapsed predictions in complex interactions, while generative modeling in high-dimensional feature spaces still remains challenging. In this work, we discover that a new type of latent action representation, which we refer to as *Residual Latent Action* (RLA), can be easily learned from DINO residuals. We also show that RLA is predictive, generalizable, and encodes temporal progression. Building on RLA, we propose *RLA World Model* (RLA-WM), which predicts RLA values via flow matching. RLA-WM outperforms both state-of-the-art feature-based and video-diffusion world models on simulation and real-world datasets, while being orders of magnitude faster than video diffusion. Furthermore, we develop two robot learning techniques that use RLA-WM to improve policy learning. The first one is a minimalist world action model with RLA that learns from actionless demonstration videos. The second one is the first visual RL framework trained entirely inside a world model learned from offline videos only, using a video-aligned reward and no online interactions or handcrafted rewards. Project page: https://mlzxy.github.io/rla-wm
PDF11May 12, 2026