ChatPaper.aiChatPaper

EmbodiedMidtrain:通过中期训练弥合视觉语言模型与视觉语言动作模型之间的鸿沟

EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

April 21, 2026
作者: Yiyang Du, Zhanqiu Guo, Xin Ye, Liu Ren, Chenyan Xiong
cs.AI

摘要

视觉-语言-动作模型(VLAs)继承了视觉-语言模型(VLMs)的视觉与语言能力,但多数VLA由未经具身领域适配的现成VLM构建,这限制了其下游性能。本研究提出EmbodiedMidtrain方法以弥合VLM与VLA之间的鸿沟。我们首先量化了两者的数据分布差异,发现VLA数据占据着与广泛VLM分布高度分离的紧凑区域,且VLM数据源间及内部的对齐程度差异显著。随后,我们构建了中期训练数据引擎:通过轻量级可学习邻近度估计器从大规模VLM池中筛选最符合VLA对齐特性的候选数据,并在下游VLA微调前对该精选数据混合集进行VLM中期训练。在三个机器人操作基准测试中,中期训练持续提升了不同VLM骨干网络的性能,其效果可与专家级VLA及以更大模型规模与训练预算训练的现成VLM相媲美。进一步分析表明,中期训练为VLA微调提供了更强的初始化基础,收益从训练初始阶段便开始显现并随进程逐步扩大。此外,该数据引擎能同时捕捉数据集层级和样本层级的对齐信号,在保留VLM数据多样性的同时更偏向空间推理任务而非文本中心任务。我们将公开全部代码、数据与模型以供后续研究。
English
Vision-Language-Action Models (VLAs) inherit their visual and linguistic capabilities from Vision-Language Models (VLMs), yet most VLAs are built from off-the-shelf VLMs that are not adapted to the embodied domain, limiting their downstream performance. In this work, we propose EmbodiedMidtrain to bridge the gap between VLMs and VLAs. We first characterize the data distribution gap between them, showing that VLA data occupy compact regions that are largely separated from the broader VLM distribution, while the degree of alignment varies substantially both across and within VLM data sources. Then, we build a mid-training data engine that leverages a lightweight learnable proximity estimator to select the most VLA-aligned candidates from a large VLM pool, and mid-trains the VLM on this curated mixture before downstream VLA fine-tuning. Experiments on three robot manipulation benchmarks show that mid-training consistently improves performance across different VLM backbones, achieving results competitive with expert VLAs and off-the-shelf VLMs trained with larger model scale and training budgets. Further analysis reveals that mid-training provides a stronger initialization for VLA fine-tuning, with gains emerging from the earliest steps and widening throughout training. Moreover, the data engine captures both dataset-level and sample-level alignment signals, favoring spatial reasoning over text-centric tasks while preserving the diversity of the VLM data. We will release all code, data and models for future research.
PDF21April 28, 2026