GigaBrain-0：一個由世界模型驅動的視覺-語言-行動模型

摘要

訓練通用機器人的視覺-語言-動作（VLA）模型通常需要大規模的真實世界機器人數據，這些數據的收集既昂貴又耗時。物理數據收集的低效性嚴重限制了當前VLA系統的可擴展性和泛化能力。為應對這一挑戰，我們引入了GigaBrain-0，這是一種新型的VLA基礎模型，其能力來自於世界模型生成的數據（例如，視頻生成、真實到真實轉移、人類轉移、視角轉移、模擬到真實轉移數據）。通過利用世界模型大規模生成多樣化數據，GigaBrain-0顯著減少了對真實機器人數據的依賴，同時提升了跨任務的泛化能力。我們的方法通過RGBD輸入建模和具身思維鏈（CoT）監督進一步提升了策略的魯棒性，使模型在任務執行過程中能夠推理空間幾何、物體狀態和長期依賴關係。這在靈巧操作、長期規劃和移動操作任務的實際表現中帶來了顯著的提升。大量實驗表明，GigaBrain-0在外觀（例如，紋理、顏色）、物體擺放和攝像機視角變化方面展現出卓越的泛化能力。此外，我們還推出了GigaBrain-0-Small，這是一個優化的輕量級變體，專為在NVIDIA Jetson AGX Orin等設備上高效運行而設計。

English

Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.

GigaBrain-0：一個由世界模型驅動的視覺-語言-行動模型

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

摘要

Support