Vlaser:具备协同具身推理能力的视觉-语言-动作模型
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
October 13, 2025
作者: Ganlin Yang, Tianyi Zhang, Haoran Hao, Weiyun Wang, Yibin Liu, Dehui Wang, Guanzhou Chen, Zijian Cai, Junting Chen, Weijie Su, Wengang Zhou, Yu Qiao, Jifeng Dai, Jiangmiao Pang, Gen Luo, Wenhai Wang, Yao Mu, Zhi Hou
cs.AI
摘要
尽管大量研究致力于利用视觉-语言模型(VLMs)开发具身推理能力,或将先进的VLMs整合进视觉-语言-动作(VLA)模型以实现端到端的机器人控制,但鲜有研究直接解决上游基于VLM的推理与下游VLA策略学习之间的关键鸿沟。在本研究中,我们迈出了将具身推理与VLA策略学习相融合的第一步,引入了Vlaser——一种具备协同具身推理能力的视觉-语言-动作模型,该模型作为基础视觉-语言模型,旨在为具身智能体整合高层推理与低层控制。依托于高质量的Vlaser-6M数据集,Vlaser在一系列具身推理基准测试中——包括空间推理、具身基础、具身问答及任务规划——均达到了业界领先水平。此外,我们系统性地探讨了不同VLM初始化对监督式VLA微调的影响,为缓解互联网规模预训练数据与具身特定策略学习数据之间的领域偏移提供了新颖见解。基于这些洞见,我们的方法在WidowX基准测试中取得了领先成果,并在Google Robot基准测试中展现了竞争力。
English
While significant research has focused on developing embodied reasoning
capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs
into Vision-Language-Action (VLA) models for end-to-end robot control, few
studies directly address the critical gap between upstream VLM-based reasoning
and downstream VLA policy learning. In this work, we take an initial step
toward bridging embodied reasoning with VLA policy learning by introducing
Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning
capability, which is a foundational vision-language model designed to integrate
high-level reasoning with low-level control for embodied agents. Built upon the
high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance
across a range of embodied reasoning benchmarks - including spatial reasoning,
embodied grounding, embodied QA, and task planning. Furthermore, we
systematically examine how different VLM initializations affect supervised VLA
fine-tuning, offering novel insights into mitigating the domain shift between
internet-scale pre-training data and embodied-specific policy learning data.
Based on these insights, our approach achieves state-of-the-art results on the
WidowX benchmark and competitive performance on the Google Robot benchmark.