EmbodiedMidtrain: 미드트레이닝을 통한 비전-언어 모델과 비전-언어-행동 모델 간의 격차 해소

초록

비전-언어-행동 모델(VLAs)은 비전-언어 모델(VLMs)로부터 시각 및 언어 능력을 계승하지만, 대부분의 VLA는 구현체 영역에 맞게 조정되지 않은 기성 VLM을 기반으로 구축되어 하류 작업 성능이 제한됩니다. 본 연구에서는 VLM과 VLA 간의 격차를 해소하기 위해 EmbodiedMidtrain을 제안합니다. 먼저 양자 간 데이터 분포 격차를 분석하여 VLA 데이터가 더 넓은 VLM 분포와 크게 분리된 조밀한 영역을 점유하며, 정렬 정도가 VLM 데이터 소스 간 및 내에서 크게 변동함을 보입니다. 이어서 대규모 VLM 풀에서 VLA와 가장 잘 정렬된 후보를 선별하기 위해 경량 학습형 근접성 추정기를 활용하는 중간 훈련 데이터 엔진을 구축하고, 하류 VLA 미세 조정 전에 이렇게 선별된 혼합 데이터로 VLM을 중간 훈련시킵니다. 세 가지 로봇 매니픈레이션 벤치마크에서의 실험结果表明, 중간 훈련은 다양한 VLM 백본에서 일관되게 성능을 향상시키며, 더 큰 모델 규모와 훈련 예산으로 훈련된 전문 VLA 및 기성 VLM과 경쟁력 있는 결과를 달성합니다. 추가 분석에 따르면 중간 훈련은 VLA 미세 조정을 위한 더 강력한 초기화를 제공하며, 이득은 훈련 초기 단계부터 나타나 전 과정에 걸쳐 확대됩니다. 또한 데이터 엔진은 데이터셋 수준과 샘플 수준의 정렬 신호를 모두 포착하여 텍스트 중심 작업보다 공간 추론을 선호하면서도 VLM 데이터의 다양성을 보존합니다. 향후 연구를 위한 모든 코드, 데이터 및 모델을 공개할 예정입니다.

English

Vision-Language-Action Models (VLAs) inherit their visual and linguistic capabilities from Vision-Language Models (VLMs), yet most VLAs are built from off-the-shelf VLMs that are not adapted to the embodied domain, limiting their downstream performance. In this work, we propose EmbodiedMidtrain to bridge the gap between VLMs and VLAs. We first characterize the data distribution gap between them, showing that VLA data occupy compact regions that are largely separated from the broader VLM distribution, while the degree of alignment varies substantially both across and within VLM data sources. Then, we build a mid-training data engine that leverages a lightweight learnable proximity estimator to select the most VLA-aligned candidates from a large VLM pool, and mid-trains the VLM on this curated mixture before downstream VLA fine-tuning. Experiments on three robot manipulation benchmarks show that mid-training consistently improves performance across different VLM backbones, achieving results competitive with expert VLAs and off-the-shelf VLMs trained with larger model scale and training budgets. Further analysis reveals that mid-training provides a stronger initialization for VLA fine-tuning, with gains emerging from the earliest steps and widening throughout training. Moreover, the data engine captures both dataset-level and sample-level alignment signals, favoring spatial reasoning over text-centric tasks while preserving the diversity of the VLM data. We will release all code, data and models for future research.

EmbodiedMidtrain: 미드트레이닝을 통한 비전-언어 모델과 비전-언어-행동 모델 간의 격차 해소

EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

초록

Support