GigaBrain-0:一个基于世界模型的视觉-语言-动作模型
GigaBrain-0: A World Model-Powered Vision-Language-Action Model
October 22, 2025
作者: GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, Peng Li, Qiuping Deng, Runqi Ouyang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yilong Li, Yiran Ding, Yuan Xu, Yun Ye, Yukun Zhou, Zhehao Dong, Zhenan Wang, Zhichao Liu, Zheng Zhu
cs.AI
摘要
训练通用型机器人的视觉-语言-动作(VLA)模型通常需要大规模的真实世界机器人数据,这些数据的收集既昂贵又耗时。物理数据收集的低效性严重限制了当前VLA系统的可扩展性和泛化能力。为解决这一挑战,我们推出了GigaBrain-0,一种基于世界模型生成数据(如视频生成、真实到真实转换、人类动作迁移、视角转换、仿真到真实转换数据)的新型VLA基础模型。通过利用世界模型大规模生成多样化数据,GigaBrain-0显著减少了对真实机器人数据的依赖,同时提升了跨任务泛化能力。我们的方法进一步通过RGBD输入建模和具身链式思维(CoT)监督增强了策略的鲁棒性,使模型能够在任务执行过程中推理空间几何、物体状态及长期依赖关系。这带来了在灵巧操作、长期规划和移动操控任务上现实世界性能的显著提升。大量实验表明,GigaBrain-0在外观(如纹理、颜色)、物体摆放和相机视角变化方面展现出卓越的泛化能力。此外,我们还推出了GigaBrain-0-Small,一个优化后的轻量级版本,专为在NVIDIA Jetson AGX Orin等设备上高效运行而设计。
English
Training Vision-Language-Action (VLA) models for generalist robots typically
requires large-scale real-world robot data, which is expensive and
time-consuming to collect. The inefficiency of physical data collection
severely limits the scalability, and generalization capacity of current VLA
systems. To address this challenge, we introduce GigaBrain-0, a novel VLA
foundation model empowered by world model-generated data (e.g., video
generation, real2real transfer, human transfer, view transfer, sim2real
transfer data). By leveraging world models to generate diverse data at scale,
GigaBrain-0 significantly reduces reliance on real robot data while improving
cross-task generalization. Our approach further improves policy robustness
through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision,
enabling the model to reason about spatial geometry, object states, and
long-horizon dependencies during task execution. This leads to substantial
gains in real-world performance on dexterous, long-horizon, and mobile
manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves
superior generalization across variations in appearances (e.g., textures,
colors), object placements, and camera viewpoints. Additionally, we present
GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently
on devices such as the NVIDIA Jetson AGX Orin.