ChatPaper.aiChatPaper

GigaWorld-Policy:一种高效以行动为中心的世界-行动模型

GigaWorld-Policy: An Efficient Action-Centered World--Action Model

March 18, 2026
作者: Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, Min Cao, Peng Li, Qiuping Deng, Wenjun Mei, Xiaofeng Wang, Xinze Chen, Xinyu Zhou, Yang Wang, Yifan Chang, Yifan Li, Yukun Zhou, Yun Ye, Zhichao Liu, Zheng Zhu
cs.AI

摘要

基于预训练视频生成主干网络初始化的世界-动作模型(WAM)在机器人策略学习领域展现出巨大潜力。然而,现有方法存在两个制约性能与部署的关键瓶颈:首先,联合推理未来视觉动态与对应动作会导致显著的推理开销;其次,联合建模容易造成视觉与运动表征的纠缠,使得动作预测精度过度依赖未来视频生成质量。为解决这些问题,我们提出GigaWorld-Policy——一种以动作为中心的WAM模型,既能学习二维像素-动作动态,又可实现高效动作解码,并支持可选的视频生成功能。具体而言,我们将策略训练解耦为两个耦合组件:模型基于当前观测预测未来动作序列,同时根据预测动作与同一观测生成未来视频。该策略通过动作预测和视频生成的双重监督获得更丰富的学习信号,并借助视觉动态约束激励物理合理的动作生成。由于采用因果设计阻止未来视频令牌影响动作令牌,在推理阶段可选择性启用显式视频生成,从而实现更快速的动作预测。为支撑该范式,我们构建了大规模多样化机器人数据集,预训练出以动作为中心的视频生成模型作为策略学习主干网络。真实机器人平台实验表明,GigaWorld-Policy比领先的WAM基线Motus运行速度快9倍,任务成功率提升7%。相较于pi-0.5模型,GigaWorld-Policy在RoboTwin 2.0环境中的性能提升达95%。
English
World-Action Models (WAM) initialized from pre-trained video generation backbones have demonstrated remarkable potential for robot policy learning. However, existing approaches face two critical bottlenecks that hinder performance and deployment. First, jointly reasoning over future visual dynamics and corresponding actions incurs substantial inference overhead. Second, joint modeling often entangles visual and motion representations, making motion prediction accuracy heavily dependent on the quality of future video forecasts. To address these issues, we introduce GigaWorld-Policy, an action-centered WAM that learns 2D pixel-action dynamics while enabling efficient action decoding, with optional video generation. Specifically, we formulate policy training into two coupled components: the model predicts future action sequences conditioned on the current observation, and simultaneously generates future videos conditioned on the predicted actions and the same observation. The policy is supervised by both action prediction and video generation, providing richer learning signals and encouraging physically plausible actions through visual-dynamics constraints. With a causal design that prevents future-video tokens from influencing action tokens, explicit future-video generation is optional at inference time, allowing faster action prediction during deployment. To support this paradigm, we curate a diverse, large-scale robot dataset to pre-train an action-centered video generation model, which is then adapted as the backbone for robot policy learning. Experimental results on real-world robotic platforms show that GigaWorld-Policy runs 9x faster than the leading WAM baseline, Motus, while improving task success rates by 7%. Moreover, compared with pi-0.5, GigaWorld-Policy improves performance by 95% on RoboTwin 2.0.
PDF212March 20, 2026