GigaBrain-0：一个基于世界模型的视觉-语言-动作模型

摘要

训练通用型机器人的视觉-语言-动作（VLA）模型通常需要大规模的真实世界机器人数据，这些数据的收集既昂贵又耗时。物理数据收集的低效性严重限制了当前VLA系统的可扩展性和泛化能力。为解决这一挑战，我们推出了GigaBrain-0，一种基于世界模型生成数据（如视频生成、真实到真实转换、人类动作迁移、视角转换、仿真到真实转换数据）的新型VLA基础模型。通过利用世界模型大规模生成多样化数据，GigaBrain-0显著减少了对真实机器人数据的依赖，同时提升了跨任务泛化能力。我们的方法进一步通过RGBD输入建模和具身链式思维（CoT）监督增强了策略的鲁棒性，使模型能够在任务执行过程中推理空间几何、物体状态及长期依赖关系。这带来了在灵巧操作、长期规划和移动操控任务上现实世界性能的显著提升。大量实验表明，GigaBrain-0在外观（如纹理、颜色）、物体摆放和相机视角变化方面展现出卓越的泛化能力。此外，我们还推出了GigaBrain-0-Small，一个优化后的轻量级版本，专为在NVIDIA Jetson AGX Orin等设备上高效运行而设计。

English

Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.

GigaBrain-0：一个基于世界模型的视觉-语言-动作模型

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

摘要

Support