VLA 초기화를 위한 VLM 표현 재고

초록

비전-언어-행동(VLA) 모델은 사전 훈련된 비전-언어 모델(VLM)을 정책 백본으로 널리 채택하지만, 어떤 종류의 사전 훈련된 VLM 표현이 VLA 초기화로 유용한지는 아직 명확하지 않다. 본 논문에서는 VLA 초기화를 능력 수준의 구현형 VQA 감독, 파라미터 업데이트 전략, 로봇 데이터 사전 훈련이라는 세 가지 축을 따라 통제된 표현 설계 문제로 연구한다. 실험 결과, 원래 사전 훈련된 VLM 표현이 행동 성능의 핵심 원천임을 보여준다. 그러나 구현형 VQA 적응이 균일한 이득을 제공하지는 않는다. 그 이점은 하위 병목 현상에 따라 달라지며, 서로 다른 능력 영역에서 얻은 이득이 단순히 가산적이지 않다. 업데이트 전략의 경우, LoRA가 전체 미세 조정보다 더 신뢰할 수 있는 초기화를 제공하는데, 이는 사전 훈련된 표현을 과도하게 변형하면 VLA 초기화가 약화될 수 있음을 시사한다. 로봇 데이터 사전 훈련은 VLA 초기화를 더욱 개선하며, 가장 강력한 변형은 단계적 LoRA 기반 훈련을 통해 얻어진다. 이러한 결과를 종합하면, 효과적인 VLM-to-VLA 적응은 행동 학습에 유용한 사전 훈련된 VLM 표현을 유지하면서 행동 관련 구현형 및 로봇 궤적 신호를 주입해야 함을 시사한다.

English

Vision-Language-Action (VLA) models widely adopt pretrained Vision-Language Models (VLMs) as policy backbones, yet it remains unclear what kind of pretrained VLM representation is useful as a VLA initialization. In this paper, we study VLA initialization as a controlled representation-design problem along three axes: capability-level embodied VQA supervision, parameter-update strategy, and robot-data pretraining. Our experiments show that the original pretrained VLM representation is a key source of action performance. However, embodied VQA adaptation does not yield uniform gains: its benefit depends on downstream bottlenecks, and gains from different capability domains are not simply additive. For update strategy, LoRA provides a more reliable initialization than Full Finetune, indicating that overly reshaping the pretrained representation can weaken VLA initialization. Robot-data pretraining further improves VLA initialization, with the strongest variant obtained by staged LoRA-based training. Together, these findings suggest that effective VLM-to-VLA adaptation should inject action-relevant embodied and robot-trajectory signals while preserving the pretrained VLM representation that remains useful for action learning.