行动影像：基于多视角视频生成的端到端策略学习

摘要

世界行动模型（WAMs）作为机器人策略学习的新兴方向，能够利用强大的视频骨干网络对未来状态进行建模。然而，现有方法通常依赖独立的行为模块，或采用非像素基础的行为表征，这既难以充分挖掘视频模型的预训练知识，也限制了跨视角与跨环境的迁移能力。本研究提出"行动图像"这一统一的世界行动模型，将策略学习构建为多视角视频生成任务。不同于将控制指令编码为低维标记，我们将7自由度机器人动作转化为可解释的行动图像：这种基于二维像素的多视角行动视频能显式追踪机械臂运动轨迹。这种像素基础的行为表征使得视频骨干网络本身即可作为零样本策略，无需额外的策略头或行为模块。除控制功能外，该统一模型还支持视频-行为联合生成、行为条件视频生成以及基于共享表征的行为标注任务。在RLBench仿真平台和真实环境测试中，我们的模型实现了最优的零样本成功率，并在视频-行为联合生成质量上超越先前的视频空间世界模型，表明可解释的行动图像是策略学习的一条有效路径。

English

World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion. This pixel-grounded action representation allows the video backbone itself to act as a zero-shot policy, without a separate policy head or action module. Beyond control, the same unified model supports video-action joint generation, action-conditioned video generation, and action labeling under a shared representation. On RLBench and real-world evaluations, our model achieves the strongest zero-shot success rates and improves video-action joint generation quality over prior video-space world models, suggesting that interpretable action images are a promising route to policy learning.

行动影像：基于多视角视频生成的端到端策略学习

Action Images: End-to-End Policy Learning via Multiview Video Generation

摘要

Support