GigaBrain-0:一個由世界模型驅動的視覺-語言-行動模型
GigaBrain-0: A World Model-Powered Vision-Language-Action Model
October 22, 2025
作者: GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, Peng Li, Qiuping Deng, Runqi Ouyang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yilong Li, Yiran Ding, Yuan Xu, Yun Ye, Yukun Zhou, Zhehao Dong, Zhenan Wang, Zhichao Liu, Zheng Zhu
cs.AI
摘要
訓練通用機器人的視覺-語言-動作(VLA)模型通常需要大規模的真實世界機器人數據,這些數據的收集既昂貴又耗時。物理數據收集的低效性嚴重限制了當前VLA系統的可擴展性和泛化能力。為應對這一挑戰,我們引入了GigaBrain-0,這是一種新型的VLA基礎模型,其能力來自於世界模型生成的數據(例如,視頻生成、真實到真實轉移、人類轉移、視角轉移、模擬到真實轉移數據)。通過利用世界模型大規模生成多樣化數據,GigaBrain-0顯著減少了對真實機器人數據的依賴,同時提升了跨任務的泛化能力。我們的方法通過RGBD輸入建模和具身思維鏈(CoT)監督進一步提升了策略的魯棒性,使模型在任務執行過程中能夠推理空間幾何、物體狀態和長期依賴關係。這在靈巧操作、長期規劃和移動操作任務的實際表現中帶來了顯著的提升。大量實驗表明,GigaBrain-0在外觀(例如,紋理、顏色)、物體擺放和攝像機視角變化方面展現出卓越的泛化能力。此外,我們還推出了GigaBrain-0-Small,這是一個優化的輕量級變體,專為在NVIDIA Jetson AGX Orin等設備上高效運行而設計。
English
Training Vision-Language-Action (VLA) models for generalist robots typically
requires large-scale real-world robot data, which is expensive and
time-consuming to collect. The inefficiency of physical data collection
severely limits the scalability, and generalization capacity of current VLA
systems. To address this challenge, we introduce GigaBrain-0, a novel VLA
foundation model empowered by world model-generated data (e.g., video
generation, real2real transfer, human transfer, view transfer, sim2real
transfer data). By leveraging world models to generate diverse data at scale,
GigaBrain-0 significantly reduces reliance on real robot data while improving
cross-task generalization. Our approach further improves policy robustness
through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision,
enabling the model to reason about spatial geometry, object states, and
long-horizon dependencies during task execution. This leads to substantial
gains in real-world performance on dexterous, long-horizon, and mobile
manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves
superior generalization across variations in appearances (e.g., textures,
colors), object placements, and camera viewpoints. Additionally, we present
GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently
on devices such as the NVIDIA Jetson AGX Orin.