MobileVLA-R1:强化移动机器人的视觉-语言-行动协同能力
MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots
November 22, 2025
作者: Ting Huang, Dongjian Li, Rui Yang, Zeyu Zhang, Zida Yang, Hao Tang
cs.AI
摘要
将自然语言指令具身化为四足机器人的连续控制任务,始终是视觉-语言-动作领域的核心挑战。现有方法难以弥合高层语义推理与底层动作执行之间的鸿沟,导致现实场景中的任务落地不稳定且泛化能力薄弱。为此,我们提出MobileVLA-R1——一个支持四足机器人显式推理与连续控制的统一视觉-语言-动作框架。通过构建包含多粒度思维链的具身轨迹数据集MobileVLA-CoT,我们为对齐任务提供了结构化推理监督。基于此,我们引入结合监督式思维链对齐与GRPO强化学习的双阶段训练范式,以增强推理一致性、控制稳定性及长周期任务执行能力。在VLN和VLA任务上的大量实验表明,本方法相较基线模型实现约5%的性能提升。四足机器人的实体部署进一步验证了其在复杂环境中的鲁棒性。代码:https://github.com/AIGeeksGroup/MobileVLA-R1 项目页面:https://aigeeksgroup.github.io/MobileVLA-R1
English
Grounding natural-language instructions into continuous control for quadruped robots remains a fundamental challenge in vision language action. Existing methods struggle to bridge high-level semantic reasoning and low-level actuation, leading to unstable grounding and weak generalization in the real world. To address these issues, we present MobileVLA-R1, a unified vision-language-action framework that enables explicit reasoning and continuous control for quadruped robots. We construct MobileVLA-CoT, a large-scale dataset of multi-granularity chain-of-thought (CoT) for embodied trajectories, providing structured reasoning supervision for alignment. Built upon this foundation, we introduce a two-stage training paradigm that combines supervised CoT alignment with GRPO reinforcement learning to enhance reasoning consistency, control stability, and long-horizon execution. Extensive evaluations on VLN and VLA tasks demonstrate superior performance over strong baselines, with approximately a 5% improvement. Real-world deployment on a quadruped robot validates robust performance in complex environments. Code: https://github.com/AIGeeksGroup/MobileVLA-R1. Website: https://aigeeksgroup.github.io/MobileVLA-R1.