GigaWorld-Policy: 効率的な行動中心の世界-行動モデル

要旨

事前学習済みのビデオ生成バックボーンから初期化されたWorld-Action Models（WAM）は、ロボット政策学習において顕著な可能性を示している。しかし、既存の手法は性能と実用化を妨げる二つの重大なボトルネックに直面している。第一に、将来の視覚的ダイナミクスと対応する行動を共同で推論することは、推論時の大幅なオーバーヘッドを招く。第二に、共同モデリングは視覚表現と動作表現をしばしば絡み合わせるため、動作予測の精度が将来ビデオ予測の品質に強く依存してしまう。これらの課題を解決するため、我々は行動中心のWAMであるGigaWorld-Policyを提案する。これは2Dピクセル-行動ダイナミクスを学習し、オプションとしてのビデオ生成を可能にしつつ、効率的な行動デコーディングを実現する。具体的には、政策学習を二つの結合されたコンポーネントとして定式化する。モデルは現在の観測に条件付けられた将来の行動系列を予測すると同時に、予測された行動と同一の観測に条件付けられた将来ビデオを生成する。政策は行動予測とビデオ生成の両方によって教師付けられ、より豊富な学習信号を提供し、視覚的ダイナミクスによる制約を通じて物理的に妥当な行動を促す。将来ビデオトークンが行動トークンに影響を与えない因果的設計により、推論時における明示的な将来ビデオ生成はオプションとなり、実運用時により高速な行動予測を可能にする。このパラダイムを支えるため、大規模で多様なロボットデータセットを精選し、行動中心のビデオ生成モデルを事前学習する。このモデルはその後、ロボット政策学習のバックボーンとして適応される。実世界のロボットプラットフォームでの実験結果は、GigaWorld-Policyが主要なWAMベースラインであるMotusよりも9倍高速に動作し、タスク成功率を7%向上させることを示している。さらに、pi-0.5と比較して、GigaWorld-PolicyはRoboTwin 2.0において性能を95%向上させる。

English

World-Action Models (WAM) initialized from pre-trained video generation backbones have demonstrated remarkable potential for robot policy learning. However, existing approaches face two critical bottlenecks that hinder performance and deployment. First, jointly reasoning over future visual dynamics and corresponding actions incurs substantial inference overhead. Second, joint modeling often entangles visual and motion representations, making motion prediction accuracy heavily dependent on the quality of future video forecasts. To address these issues, we introduce GigaWorld-Policy, an action-centered WAM that learns 2D pixel-action dynamics while enabling efficient action decoding, with optional video generation. Specifically, we formulate policy training into two coupled components: the model predicts future action sequences conditioned on the current observation, and simultaneously generates future videos conditioned on the predicted actions and the same observation. The policy is supervised by both action prediction and video generation, providing richer learning signals and encouraging physically plausible actions through visual-dynamics constraints. With a causal design that prevents future-video tokens from influencing action tokens, explicit future-video generation is optional at inference time, allowing faster action prediction during deployment. To support this paradigm, we curate a diverse, large-scale robot dataset to pre-train an action-centered video generation model, which is then adapted as the backbone for robot policy learning. Experimental results on real-world robotic platforms show that GigaWorld-Policy runs 9x faster than the leading WAM baseline, Motus, while improving task success rates by 7%. Moreover, compared with pi-0.5, GigaWorld-Policy improves performance by 95% on RoboTwin 2.0.

GigaWorld-Policy: 効率的な行動中心の世界-行動モデル

GigaWorld-Policy: An Efficient Action-Centered World--Action Model

要旨

Support