Vlaser:視覺-語言-行動協同具身推理模型
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
October 13, 2025
作者: Ganlin Yang, Tianyi Zhang, Haoran Hao, Weiyun Wang, Yibin Liu, Dehui Wang, Guanzhou Chen, Zijian Cai, Junting Chen, Weijie Su, Wengang Zhou, Yu Qiao, Jifeng Dai, Jiangmiao Pang, Gen Luo, Wenhai Wang, Yao Mu, Zhi Hou
cs.AI
摘要
儘管已有大量研究致力於利用視覺-語言模型(VLMs)開發具身推理能力,或將先進的VLMs整合至視覺-語言-動作(VLA)模型中,以實現端到端的機器人控制,但鮮有研究直接探討上游基於VLM的推理與下游VLA策略學習之間的重要鴻溝。在本研究中,我們邁出了將具身推理與VLA策略學習相結合的初步步伐,引入了Vlaser——一種具備協同具身推理能力的視覺-語言-動作模型,這是一個旨在為具身智能體整合高層次推理與低層次控制的基礎視覺-語言模型。基於高質量的Vlaser-6M數據集,Vlaser在一系列具身推理基準測試中——包括空間推理、具身接地、具身問答及任務規劃——均達到了最先進的性能。此外,我們系統地考察了不同VLM初始化對監督式VLA微調的影響,為緩解互聯網規模預訓練數據與具身特定策略學習數據之間的領域轉移提供了新穎見解。基於這些見解,我們的方法在WidowX基準測試中取得了最優結果,並在Google Robot基準測試中展現了競爭力。
English
While significant research has focused on developing embodied reasoning
capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs
into Vision-Language-Action (VLA) models for end-to-end robot control, few
studies directly address the critical gap between upstream VLM-based reasoning
and downstream VLA policy learning. In this work, we take an initial step
toward bridging embodied reasoning with VLA policy learning by introducing
Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning
capability, which is a foundational vision-language model designed to integrate
high-level reasoning with low-level control for embodied agents. Built upon the
high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance
across a range of embodied reasoning benchmarks - including spatial reasoning,
embodied grounding, embodied QA, and task planning. Furthermore, we
systematically examine how different VLM initializations affect supervised VLA
fine-tuning, offering novel insights into mitigating the domain shift between
internet-scale pre-training data and embodied-specific policy learning data.
Based on these insights, our approach achieves state-of-the-art results on the
WidowX benchmark and competitive performance on the Google Robot benchmark.