APT：行動專家預訓練提升視覺-語言-行動策略的指令泛化能力

摘要

視覺-語言-動作（VLA）模型透過結合預訓練視覺語言模型（VLM）與連續動作專家，在操控任務上展現出強大效能，然而在處理分佈外（OOD）語言指令時仍存在泛化能力不足的問題。其中一項已知挑戰來自VLA數據的結構不平衡：相較於視覺與動作內容，語言的多元性遠低於兩者，導致策略容易傾向視覺捷徑。儘管離散動作方法透過視覺-語言共同訓練可緩解此問題，但連續動作專家缺乏這類保護機制——它們從隨機初始化開始訓練，完全依賴不平衡數據，導致產生雜訊梯度，不僅破壞VLM的表現，也無法充分發揮其語言能力。本研究從貝葉斯觀點出發，將策略分解為與語言無關的視覺-動作（VA）先驗，以及語言條件化的VLA似然，並提出APT——一種強調動作專家預訓練的兩階段訓練方法。第一階段中，動作專家在凍結VLM的基礎上，僅以視覺-動作對進行預訓練，藉此繞過語言不平衡問題。第二階段則透過閘控融合機制注入語言標記，在整合VLM特徵的同時保留已習得的視覺運動先驗。APT可適用於主流VLA架構，包括π型與GR00T型架構。全面實驗證實，APT在未見過的指令與組合性任務上均能穩定提升表現。專案頁面：https://xukechun.github.io/papers/APT/

English

Vision-Language-Action (VLA) models that couple pretrained Vision-Language Models (VLMs) with continuous action experts have achieved strong manipulation performance, yet generalization to out-of-distribution (OOD) language instructions remains poor. A known challenge is the structural imbalance in VLA data, where language is far less diverse than visual and action content, making policies prone to visual shortcuts. While discrete-action methods mitigate this through vision-language co-training, continuous action experts lack such protection: they start from random initialization and learn entirely from imbalanced data, producing noisy gradients that corrupt the VLM and fail to exploit its language capability. We address this from a Bayesian perspective, factorizing the policy into a language-agnostic Vision-Action (VA) prior and a language-conditioned VLA likelihood, and propose APT, a two-stage training method emphasizing Action expert PreTraining. In Stage 1, the action expert is pretrained as a VA prior on vision-action pairs from a frozen VLM, bypassing the language imbalance. In Stage 2, language tokens are injected through a gated fusion mechanism that integrates VLM features while preserving the learned visuomotor prior. APT applies to mainstream VLA architectures, including the π and GR00T-style architectures. Comprehensive experiments validate that APT achieves consistent gains on unseen instructions and compositional tasks. Project Page: https://xukechun.github.io/papers/APT/