GigaWorld-Policy: 효율적인 행동 중심 세계-행동 모델

초록

사전 훈련된 비디오 생성 백본으로 초기화된 World-Action Models(WAM)은 로봇 정책 학습에 놀라운 잠재력을 보여주고 있습니다. 그러나 기존 접근법은 성능과 배포를 저해하는 두 가지 중요한 병목 현상에 직면해 있습니다. 첫째, 미래의 시각 역학과 이에 상응하는 행동에 대한 공동 추론은 상당한 추론 오버헤드를 초래합니다. 둘째, 공동 모델링은 종종 시각 및 운동 표현을 얽히게 하여 운동 예측 정확도가 미래 비디오 예측의 질에 크게 의존하게 만듭니다. 이러한 문제를 해결하기 위해 우리는 2D 픽셀-행동 역학을 학습하면서 선택적 비디오 생성과 함께 효율적인 행동 디코딩을 가능하게 하는 행동 중심 WAM인 GigaWorld-Policy를 소개합니다. 구체적으로, 우리는 정책 훈련을 두 개의 결합된 구성 요소로 공식화합니다. 모델은 현재 관측을 조건으로 미래 행동 순서를 예측하고, 동시에 예측된 행동과 동일한 관측을 조건으로 미래 비디오를 생성합니다. 정책은 행동 예측과 비디오 생성 모두에 의해 지도 학습되어 더 풍부한 학습 신호를 제공하고 시각 역학 제약을 통해 물리적으로 타당한 행동을 장려합니다. 미래 비디오 토큰이 행동 토큰에 영향을 미치는 것을 방지하는 인과적 설계로 인해, 명시적 미래 비디오 생성은 추론 시 선택 사항이 되어 배포 시 더 빠른 행동 예측을 가능하게 합니다. 이러한 패러다임을 지원하기 위해, 우리는 행동 중심 비디오 생성 모델을 사전 훈련하기 위해 다양하고 대규모의 로봇 데이터셋을 구축하였으며, 이는 이후 로봇 정책 학습을 위한 백본으로 적용됩니다. 실제 로봇 플랫폼에서의 실험 결과는 GigaWorld-Policy가 선도적인 WAM 기준인 Motus보다 9배 빠르게 실행되면서 작업 성공률을 7% 향상시킴을 보여줍니다. 더 나아가, pi-0.5와 비교했을 때 GigaWorld-Policy는 RoboTwin 2.0에서 성능을 95% 향상시켰습니다.

English

World-Action Models (WAM) initialized from pre-trained video generation backbones have demonstrated remarkable potential for robot policy learning. However, existing approaches face two critical bottlenecks that hinder performance and deployment. First, jointly reasoning over future visual dynamics and corresponding actions incurs substantial inference overhead. Second, joint modeling often entangles visual and motion representations, making motion prediction accuracy heavily dependent on the quality of future video forecasts. To address these issues, we introduce GigaWorld-Policy, an action-centered WAM that learns 2D pixel-action dynamics while enabling efficient action decoding, with optional video generation. Specifically, we formulate policy training into two coupled components: the model predicts future action sequences conditioned on the current observation, and simultaneously generates future videos conditioned on the predicted actions and the same observation. The policy is supervised by both action prediction and video generation, providing richer learning signals and encouraging physically plausible actions through visual-dynamics constraints. With a causal design that prevents future-video tokens from influencing action tokens, explicit future-video generation is optional at inference time, allowing faster action prediction during deployment. To support this paradigm, we curate a diverse, large-scale robot dataset to pre-train an action-centered video generation model, which is then adapted as the backbone for robot policy learning. Experimental results on real-world robotic platforms show that GigaWorld-Policy runs 9x faster than the leading WAM baseline, Motus, while improving task success rates by 7%. Moreover, compared with pi-0.5, GigaWorld-Policy improves performance by 95% on RoboTwin 2.0.

GigaWorld-Policy: 효율적인 행동 중심 세계-행동 모델

GigaWorld-Policy: An Efficient Action-Centered World--Action Model

초록

Support