綠色視覺語言行動模型:面向通用機器人的分階段視覺語言行動整合架構
Green-VLA: Staged Vision-Language-Action Model for Generalist Robots
January 31, 2026
作者: I. Apanasevich, M. Artemyev, R. Babakyan, P. Fedotova, D. Grankin, E. Kupryashin, A. Misailidi, D. Nerus, A. Nutalapati, G. Sidorov, I. Efremov, M. Gerasyov, D. Pikurov, Y. Senchenko, S. Davidenko, D. Kulikov, M. Sultankin, K. Askarbek, O. Shamanin, D. Statovoy, E. Zalyaev, I. Zorin, A. Letkin, E. Rusakov, A. Silchenko, V. Vorobyov, S. Sobolnikov, A. Postnikov
cs.AI
摘要
我們推出Green-VLA——一個分階段的視覺-語言-動作框架,專為Green人形機器人的實際部署設計,同時保持跨多樣化具身系統的泛化能力。該框架遵循五階段課程體系:(L0)基礎視覺語言模型、(L1)多模態接地、(R0)多具身預訓練、(R1)具身專用適配、(R2)強化學習策略對齊。我們將可擴展的數據處理流程(含3,000小時示範數據)與時序對齊及質量篩選相結合,並採用統一的具身感知動作接口,使單一策略能同時控制人形機器人、移動機械臂及固定基座機械臂。在推理階段,通過引入任務進度預測、分佈外檢測和基於關節預測的引導機制,增強VLA控制器的安全性與精準目標選擇能力。在Simpler BRIDGE WidowX與CALVIN ABC-D模擬環境的實驗及真實機器人評估中,該框架展現出強大的泛化性能,並通過強化學習對齊在成功率、魯棒性和長時序任務效率方面實現顯著提升。
English
We introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments. Green-VLA follows a five stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) reinforcement-learning (RL) policy alignment. We couple a scalable data-processing pipeline (3,000 hours of demonstrations) with temporal alignment and quality filtering, and use a unified, embodiment-aware action interface enabling a single policy to control humanoids, mobile manipulators, and fixed-base arms. At inference, the VLA controller is enhanced with episode-progress prediction, out-of-distribution detection, and joint-prediction-based guidance to improve safety and precise target selection. Experiments on Simpler BRIDGE WidowX and CALVIN ABC-D, as well as real-robot evaluations, show strong generalization and performance gains from RL alignment in success rate, robustness, and long-horizon efficiency.