绿色视觉语言动作模型:面向通用机器人的分阶段视觉-语言-动作架构
Green-VLA: Staged Vision-Language-Action Model for Generalist Robots
January 31, 2026
作者: I. Apanasevich, M. Artemyev, R. Babakyan, P. Fedotova, D. Grankin, E. Kupryashin, A. Misailidi, D. Nerus, A. Nutalapati, G. Sidorov, I. Efremov, M. Gerasyov, D. Pikurov, Y. Senchenko, S. Davidenko, D. Kulikov, M. Sultankin, K. Askarbek, O. Shamanin, D. Statovoy, E. Zalyaev, I. Zorin, A. Letkin, E. Rusakov, A. Silchenko, V. Vorobyov, S. Sobolnikov, A. Postnikov
cs.AI
摘要
我们推出Green-VLA——一种分阶段实施的视觉-语言-动作框架,专为Green仿人机器人的实际部署设计,同时保持对不同形态机器人的泛化能力。该框架采用五阶段渐进式课程:(L0)基础视觉语言模型、(L1)多模态接地、(R0)多形态预训练、(R1)特定形态适配、(R2)强化学习策略对齐。我们通过时序对齐与质量过滤构建了可扩展的数据处理流水线(涵盖3000小时演示数据),并采用统一的多形态感知动作接口,使单一策略能同时控制仿人机器人、移动机械臂与固定基座机械臂。在推理阶段,该VLA控制器集成了任务进度预测、分布外检测和基于关节预测的引导机制,以提升安全性与目标选择精度。在Simpler BRIDGE WidowX与CALVIN ABC-D仿真环境及实体机器人上的实验表明,经过强化学习对齐的策略在成功率、鲁棒性和长周期任务效率方面均展现出卓越的泛化能力与性能提升。
English
We introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments. Green-VLA follows a five stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) reinforcement-learning (RL) policy alignment. We couple a scalable data-processing pipeline (3,000 hours of demonstrations) with temporal alignment and quality filtering, and use a unified, embodiment-aware action interface enabling a single policy to control humanoids, mobile manipulators, and fixed-base arms. At inference, the VLA controller is enhanced with episode-progress prediction, out-of-distribution detection, and joint-prediction-based guidance to improve safety and precise target selection. Experiments on Simpler BRIDGE WidowX and CALVIN ABC-D, as well as real-robot evaluations, show strong generalization and performance gains from RL alignment in success rate, robustness, and long-horizon efficiency.