APT: 动作专家预训练提升视觉-语言-动作策略的指令泛化能力

摘要

将预训练的视觉-语言模型（VLM）与连续动作专家相结合的视觉-语言-动作（VLA）模型在操作任务中表现出色，但对分布外（OOD）语言指令的泛化能力仍然薄弱。一个已知挑战是VLA数据中的结构性不平衡：语言的多样性远低于视觉和动作内容，导致策略容易依赖视觉捷径。尽管离散动作方法通过视觉-语言协同训练缓解了这一问题，但连续动作专家缺乏这种保护：它们从随机初始化开始，完全从不平衡数据中学习，产生噪声梯度，从而破坏VLM并未能利用其语言能力。我们从贝叶斯视角出发，将策略分解为与语言无关的视觉-动作（VA）先验和语言条件化的VLA似然，并提出了APT——一种强调动作专家预训练（Action expert PreTraining）的两阶段训练方法。在第一阶段，动作专家作为VA先验在来自冻结VLM的视觉-动作对上进行预训练，从而绕过语言不平衡问题。在第二阶段，通过一种门控融合机制注入语言标记，该机制在保留已学习的视觉运动先验的同时整合VLM特征。APT适用于主流VLA架构，包括π和GR00T风格架构。综合实验验证了APT在未见指令和组合任务上实现了一致的性能提升。项目页面：https://xukechun.github.io/papers/APT/

English

Vision-Language-Action (VLA) models that couple pretrained Vision-Language Models (VLMs) with continuous action experts have achieved strong manipulation performance, yet generalization to out-of-distribution (OOD) language instructions remains poor. A known challenge is the structural imbalance in VLA data, where language is far less diverse than visual and action content, making policies prone to visual shortcuts. While discrete-action methods mitigate this through vision-language co-training, continuous action experts lack such protection: they start from random initialization and learn entirely from imbalanced data, producing noisy gradients that corrupt the VLM and fail to exploit its language capability. We address this from a Bayesian perspective, factorizing the policy into a language-agnostic Vision-Action (VA) prior and a language-conditioned VLA likelihood, and propose APT, a two-stage training method emphasizing Action expert PreTraining. In Stage 1, the action expert is pretrained as a VA prior on vision-action pairs from a frozen VLM, bypassing the language imbalance. In Stage 2, language tokens are injected through a gated fusion mechanism that integrates VLM features while preserving the learned visuomotor prior. APT applies to mainstream VLA architectures, including the π and GR00T-style architectures. Comprehensive experiments validate that APT achieves consistent gains on unseen instructions and compositional tasks. Project Page: https://xukechun.github.io/papers/APT/