ChatPaper.aiChatPaper

GigaWorld-Policy:一種以行動為中心的高效世界—行動模型

GigaWorld-Policy: An Efficient Action-Centered World--Action Model

March 18, 2026
作者: Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, Min Cao, Peng Li, Qiuping Deng, Wenjun Mei, Xiaofeng Wang, Xinze Chen, Xinyu Zhou, Yang Wang, Yifan Chang, Yifan Li, Yukun Zhou, Yun Ye, Zhichao Liu, Zheng Zhu
cs.AI

摘要

基於預訓練影片生成骨幹網絡初始化的世界行動模型(WAM)在機器人策略學習領域展現出巨大潛力。然而,現有方法存在兩個制約效能與部署的關鍵瓶頸:首先,聯合推理未來視覺動態與對應行動會導致顯著的推論開銷;其次,聯合建模常使視覺表徵與運動表徵相互糾纏,導致運動預測精度過度依賴未來影片預測的品質。為解決這些問題,我們提出GigaWorld-Policy——一種以行動為核心的WAM模型,其既能學習二維像素-行動動態關係,又可實現高效行動解碼,並支持可選的影片生成功能。具體而言,我們將策略訓練框架拆分為兩個耦合組件:模型根據當前觀測預測未來行動序列,同時基於預測行動與相同觀測生成未來影片。該策略同時接受行動預測與影片生成的監督,既提供更豐富的學習信號,又通過視覺動態約束激勵物理合理的行動生成。憑藉能阻止未來影片標記影響行動標記的因果設計,顯式未來影片生成在推論階段成為可選功能,從而實現部署期間更快速的行動預測。為支撐此範式,我們構建大規模多樣化機器人數據集,預訓練以行動為核心的影片生成模型,並將其作為機器人策略學習的骨幹網絡。真實機器人平台實驗表明,GigaWorld-Policy的運行速度比領先的WAM基線模型Motus快9倍,同時將任務成功率提升7%。此外,相較於pi-0.5模型,GigaWorld-Policy在RoboTwin 2.0環境中的性能提升達95%。
English
World-Action Models (WAM) initialized from pre-trained video generation backbones have demonstrated remarkable potential for robot policy learning. However, existing approaches face two critical bottlenecks that hinder performance and deployment. First, jointly reasoning over future visual dynamics and corresponding actions incurs substantial inference overhead. Second, joint modeling often entangles visual and motion representations, making motion prediction accuracy heavily dependent on the quality of future video forecasts. To address these issues, we introduce GigaWorld-Policy, an action-centered WAM that learns 2D pixel-action dynamics while enabling efficient action decoding, with optional video generation. Specifically, we formulate policy training into two coupled components: the model predicts future action sequences conditioned on the current observation, and simultaneously generates future videos conditioned on the predicted actions and the same observation. The policy is supervised by both action prediction and video generation, providing richer learning signals and encouraging physically plausible actions through visual-dynamics constraints. With a causal design that prevents future-video tokens from influencing action tokens, explicit future-video generation is optional at inference time, allowing faster action prediction during deployment. To support this paradigm, we curate a diverse, large-scale robot dataset to pre-train an action-centered video generation model, which is then adapted as the backbone for robot policy learning. Experimental results on real-world robotic platforms show that GigaWorld-Policy runs 9x faster than the leading WAM baseline, Motus, while improving task success rates by 7%. Moreover, compared with pi-0.5, GigaWorld-Policy improves performance by 95% on RoboTwin 2.0.
PDF212March 20, 2026