ChatPaper.aiChatPaper

APT:行動專家預訓練提升視覺-語言-行動策略的指令泛化能力

APT: Action Expert Pretraining Improves Instruction Generalization of Vision-Language-Action Policies

June 10, 2026
作者: Kechun Xu, Zhenjie Zhu, Anzhe Chen, Rong Xiong, Yue Wang
cs.AI

摘要

視覺-語言-動作(VLA)模型透過結合預訓練視覺語言模型(VLM)與連續動作專家,在操控任務上展現出強大效能,然而在處理分佈外(OOD)語言指令時仍存在泛化能力不足的問題。其中一項已知挑戰來自VLA數據的結構不平衡:相較於視覺與動作內容,語言的多元性遠低於兩者,導致策略容易傾向視覺捷徑。儘管離散動作方法透過視覺-語言共同訓練可緩解此問題,但連續動作專家缺乏這類保護機制——它們從隨機初始化開始訓練,完全依賴不平衡數據,導致產生雜訊梯度,不僅破壞VLM的表現,也無法充分發揮其語言能力。本研究從貝葉斯觀點出發,將策略分解為與語言無關的視覺-動作(VA)先驗,以及語言條件化的VLA似然,並提出APT——一種強調動作專家預訓練的兩階段訓練方法。第一階段中,動作專家在凍結VLM的基礎上,僅以視覺-動作對進行預訓練,藉此繞過語言不平衡問題。第二階段則透過閘控融合機制注入語言標記,在整合VLM特徵的同時保留已習得的視覺運動先驗。APT可適用於主流VLA架構,包括π型與GR00T型架構。全面實驗證實,APT在未見過的指令與組合性任務上均能穩定提升表現。專案頁面:https://xukechun.github.io/papers/APT/
English
Vision-Language-Action (VLA) models that couple pretrained Vision-Language Models (VLMs) with continuous action experts have achieved strong manipulation performance, yet generalization to out-of-distribution (OOD) language instructions remains poor. A known challenge is the structural imbalance in VLA data, where language is far less diverse than visual and action content, making policies prone to visual shortcuts. While discrete-action methods mitigate this through vision-language co-training, continuous action experts lack such protection: they start from random initialization and learn entirely from imbalanced data, producing noisy gradients that corrupt the VLM and fail to exploit its language capability. We address this from a Bayesian perspective, factorizing the policy into a language-agnostic Vision-Action (VA) prior and a language-conditioned VLA likelihood, and propose APT, a two-stage training method emphasizing Action expert PreTraining. In Stage 1, the action expert is pretrained as a VA prior on vision-action pairs from a frozen VLM, bypassing the language imbalance. In Stage 2, language tokens are injected through a gated fusion mechanism that integrates VLM features while preserving the learned visuomotor prior. APT applies to mainstream VLA architectures, including the π and GR00T-style architectures. Comprehensive experiments validate that APT achieves consistent gains on unseen instructions and compositional tasks. Project Page: https://xukechun.github.io/papers/APT/