MobileVLA-R1:强化移动机器人的视觉-语言-行动协同能力
MobileVLA-R1: Reinforcing Vision-Language-Action for Mobile Robots
November 22, 2025
作者: Ting Huang, Dongjian Li, Rui Yang, Zeyu Zhang, Zida Yang, Hao Tang
cs.AI
摘要
将自然语言指令转化为四足机器人的连续控制始终是视觉语言行动领域的核心挑战。现有方法难以弥合高层语义推理与底层驱动之间的鸿沟,导致现实场景中存在 grounding 不稳定和泛化能力弱的问题。为此,我们提出MobileVLA-R1——一个支持四足机器人显式推理与连续控制的统一视觉语言行动框架。我们构建了MobileVLA-CoT数据集,包含具身轨迹的多粒度思维链,为对齐任务提供结构化推理监督。基于此,我们引入结合监督式CoT对齐与GRPO强化学习的两阶段训练范式,显著提升推理一致性、控制稳定性和长周期任务执行能力。在VLN和VLA任务上的大量实验表明,该方法较基线模型性能提升约5%。四足机器人的实体部署验证了其在复杂环境中的鲁棒性。代码:https://github.com/AIGeeksGroup/MobileVLA-R1 项目网站:https://aigeeksgroup.github.io/MobileVLA-R1
English
Grounding natural-language instructions into continuous control for quadruped robots remains a fundamental challenge in vision language action. Existing methods struggle to bridge high-level semantic reasoning and low-level actuation, leading to unstable grounding and weak generalization in the real world. To address these issues, we present MobileVLA-R1, a unified vision-language-action framework that enables explicit reasoning and continuous control for quadruped robots. We construct MobileVLA-CoT, a large-scale dataset of multi-granularity chain-of-thought (CoT) for embodied trajectories, providing structured reasoning supervision for alignment. Built upon this foundation, we introduce a two-stage training paradigm that combines supervised CoT alignment with GRPO reinforcement learning to enhance reasoning consistency, control stability, and long-horizon execution. Extensive evaluations on VLN and VLA tasks demonstrate superior performance over strong baselines, with approximately a 5% improvement. Real-world deployment on a quadruped robot validates robust performance in complex environments. Code: https://github.com/AIGeeksGroup/MobileVLA-R1. Website: https://aigeeksgroup.github.io/MobileVLA-R1.