APT: 行動専門家事前学習による視覚・言語・行動ポリシーの指示汎化の向上

要旨

視覚・言語・行動（VLA）モデルは、事前学習された視覚・言語モデル（VLM）を連続行動エキスパートと結合することで、強力な操作性能を達成しているが、分布外（OOD）の言語指示への一般化は依然として不十分である。既知の課題として、VLAデータにおける構造的不均衡、すなわち言語が視覚や行動コンテンツに比べて多様性に乏しいため、方策が視覚的な近道に依存しやすくなることが挙げられる。離散行動手法は視覚言語の共学習によってこの問題を緩和するが、連続行動エキスパートにはそのような保護がなく、ランダム初期化から始まり不均衡なデータのみから学習するため、ノイズの多い勾配がVLMを損ない、その言語能力を活用できなくなる。我々はこの問題をベイズ的観点から捉え、方策を言語非依存の視覚・行動（VA）事前分布と言語条件付きVLA尤度に分解し、行動エキスパートの事前学習（Action expert PreTraining）を重視する2段階学習法APTを提案する。第1段階では、凍結したVLMからの視覚・行動ペアを用いて行動エキスパートをVA事前分布として事前学習し、言語の不均衡を回避する。第2段階では、ゲート付き融合機構により言語トークンを注入し、学習済みの視覚運動事前分布を保持しながらVLM特徴を統合する。APTは、π型およびGR00T型アーキテクチャを含む主流のVLAアーキテクチャに適用可能である。包括的な実験により、APTが未見の指示や構成タスクにおいて一貫した性能向上を達成することを検証した。プロジェクトページ: https://xukechun.github.io/papers/APT/

English

Vision-Language-Action (VLA) models that couple pretrained Vision-Language Models (VLMs) with continuous action experts have achieved strong manipulation performance, yet generalization to out-of-distribution (OOD) language instructions remains poor. A known challenge is the structural imbalance in VLA data, where language is far less diverse than visual and action content, making policies prone to visual shortcuts. While discrete-action methods mitigate this through vision-language co-training, continuous action experts lack such protection: they start from random initialization and learn entirely from imbalanced data, producing noisy gradients that corrupt the VLM and fail to exploit its language capability. We address this from a Bayesian perspective, factorizing the policy into a language-agnostic Vision-Action (VA) prior and a language-conditioned VLA likelihood, and propose APT, a two-stage training method emphasizing Action expert PreTraining. In Stage 1, the action expert is pretrained as a VA prior on vision-action pairs from a frozen VLM, bypassing the language imbalance. In Stage 2, language tokens are injected through a gated fusion mechanism that integrates VLM features while preserving the learned visuomotor prior. APT applies to mainstream VLA architectures, including the π and GR00T-style architectures. Comprehensive experiments validate that APT achieves consistent gains on unseen instructions and compositional tasks. Project Page: https://xukechun.github.io/papers/APT/