行动影像:基于多视角视频生成的端到端策略学习
Action Images: End-to-End Policy Learning via Multiview Video Generation
April 7, 2026
作者: Haoyu Zhen, Zixian Gao, Qiao Sun, Yilin Zhao, Yuncong Yang, Yilun Du, Tsun-Hsuan Wang, Yi-Ling Qiao, Chuang Gan
cs.AI
摘要
世界行动模型(WAMs)作为机器人策略学习的新兴方向,因其能利用强大的视频骨干网络对未来状态进行建模而备受关注。然而,现有方法通常依赖独立的行为模块,或采用非像素级锚定的行为表征,这既难以充分释放视频模型的预训练知识潜力,也限制了跨视角与跨环境的迁移能力。本研究提出"行动图像"这一统一的世界行动模型,将策略学习重构为多视角视频生成任务。通过将7自由度机器人动作转化为可解释的行动图像——即基于二维像素锚定、显式追踪机械臂运动的多视角动作视频,我们摒弃了传统低维令牌的行为编码方式。这种像素级锚定的行为表征使得视频骨干网络本身即可作为零样本策略,无需额外的策略头或行为模块。除控制功能外,该统一模型还支持视频-行为联合生成、行为条件视频生成及共享表征下的行为标注任务。在RLBench仿真与真实环境测试中,我们的模型实现了最优的零样本成功率,并在视频-行为联合生成质量上超越现有视频空间世界模型,表明可解释的行动图像是策略学习的有效路径。
English
World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion. This pixel-grounded action representation allows the video backbone itself to act as a zero-shot policy, without a separate policy head or action module. Beyond control, the same unified model supports video-action joint generation, action-conditioned video generation, and action labeling under a shared representation. On RLBench and real-world evaluations, our model achieves the strongest zero-shot success rates and improves video-action joint generation quality over prior video-space world models, suggesting that interpretable action images are a promising route to policy learning.