ChatPaper.aiChatPaper

EmbodiedMidtrain:通过中期训练弥合视觉语言模型与视觉语言动作模型之间的鸿沟

EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

April 21, 2026
作者: Yiyang Du, Zhanqiu Guo, Xin Ye, Liu Ren, Chenyan Xiong
cs.AI

摘要

視覺-語言-動作模型(VLA)雖繼承了視覺-語言模型(VLM)的視覺與語言能力,但現有VLA大多基於未經具身領域適配的現成VLM構建,這限制了其下游任務性能。本研究提出**具身中間訓練法(EmbodiedMidtrain)** 以彌補VLM與VLA之間的鴻溝。我們首先量化了兩者的數據分佈差異,發現VLA數據集中於與廣泛VLM分佈高度分離的緊湊區域,且不同VLM數據源之間及其內部的對齊程度存在顯著差異。隨後,我們構建了一個中間訓練數據引擎:通過輕量級可學習鄰近度估計器從大規模VLM池中篩選出與VLA最匹配的候選數據,並在進行下游VLA微調前對VLM實施基於此精選混合數據的中間訓練。在三個機器人操作基準測試上的實驗表明,中間訓練能持續提升不同VLM骨幹網絡的性能,其效果可與專家級VLA及耗費更大模型規模與訓練資源的現成VLM相媲美。深入分析揭示,中間訓練為VLA微調提供了更優的初始化條件,其增益從訓練初始階段便開始顯現並隨進程持續擴大。此外,數據引擎能同時捕捉數據集層面與樣本層面的對齊信號,在保留VLM數據多樣性的同時,更偏重空間推理任務而非文本中心任務。我們將公開所有代碼、數據與模型以促進後續研究。
English
Vision-Language-Action Models (VLAs) inherit their visual and linguistic capabilities from Vision-Language Models (VLMs), yet most VLAs are built from off-the-shelf VLMs that are not adapted to the embodied domain, limiting their downstream performance. In this work, we propose EmbodiedMidtrain to bridge the gap between VLMs and VLAs. We first characterize the data distribution gap between them, showing that VLA data occupy compact regions that are largely separated from the broader VLM distribution, while the degree of alignment varies substantially both across and within VLM data sources. Then, we build a mid-training data engine that leverages a lightweight learnable proximity estimator to select the most VLA-aligned candidates from a large VLM pool, and mid-trains the VLM on this curated mixture before downstream VLA fine-tuning. Experiments on three robot manipulation benchmarks show that mid-training consistently improves performance across different VLM backbones, achieving results competitive with expert VLAs and off-the-shelf VLMs trained with larger model scale and training budgets. Further analysis reveals that mid-training provides a stronger initialization for VLA fine-tuning, with gains emerging from the earliest steps and widening throughout training. Moreover, the data engine captures both dataset-level and sample-level alignment signals, favoring spatial reasoning over text-centric tasks while preserving the diversity of the VLM data. We will release all code, data and models for future research.
PDF21April 28, 2026