Vlaser：具备协同具身推理能力的视觉-语言-动作模型

摘要

尽管大量研究致力于利用视觉-语言模型（VLMs）开发具身推理能力，或将先进的VLMs整合进视觉-语言-动作（VLA）模型以实现端到端的机器人控制，但鲜有研究直接解决上游基于VLM的推理与下游VLA策略学习之间的关键鸿沟。在本研究中，我们迈出了将具身推理与VLA策略学习相融合的第一步，引入了Vlaser——一种具备协同具身推理能力的视觉-语言-动作模型，该模型作为基础视觉-语言模型，旨在为具身智能体整合高层推理与低层控制。依托于高质量的Vlaser-6M数据集，Vlaser在一系列具身推理基准测试中——包括空间推理、具身基础、具身问答及任务规划——均达到了业界领先水平。此外，我们系统性地探讨了不同VLM初始化对监督式VLA微调的影响，为缓解互联网规模预训练数据与具身特定策略学习数据之间的领域偏移提供了新颖见解。基于这些洞见，我们的方法在WidowX基准测试中取得了领先成果，并在Google Robot基准测试中展现了竞争力。

English

While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks - including spatial reasoning, embodied grounding, embodied QA, and task planning. Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.

Vlaser：具备协同具身推理能力的视觉-语言-动作模型

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

摘要

Support