貝氏視覺語言動作模型：基於潛在動作查詢的貝氏分解方法

摘要

視覺-語言-動作（VLA）模型在機器人操作領域展現出潛力，但常難以泛化至新指令或複雜的多任務場景。我們發現當前訓練範式存在一個關鍵缺陷：目標驅動的資料收集會導致數據集偏差。這類數據集中，語言指令僅從視覺觀測即可高度預測，致使指令與動作之間的條件互信息趨於消失——我們稱此現象為「信息坍縮」。其結果是模型退化為純視覺策略，忽略語言約束並在分佈外（OOD）設定下失效。為解決此問題，我們提出BayesianVLA框架，透過貝葉斯分解強制實現指令跟隨。通過引入可學習的潛在動作查詢，我們構建雙分支架構以同時估計純視覺先驗分佈 p(a|v) 和語言條件後驗分佈 π(a|v, ℓ)。隨後優化策略以最大化動作與指令的條件點間互信息（PMI）。該目標函數有效懲罰視覺捷徑思維，並獎勵能顯式解釋語言命令的動作。無需新增數據，BayesianVLA即顯著提升泛化能力。在SimplerEnv與RoboCasa上的大量實驗證實了顯著增益，其中在具挑戰性的OOD SimplerEnv基準測試中提升達11.3%，驗證了本方法在動作中穩健紮根語言的能力。

English

Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose BayesianVLA, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior p(a mid v) and a language-conditioned posterior π(a mid v, ell). We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, BayesianVLA significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.

貝氏視覺語言動作模型：基於潛在動作查詢的貝氏分解方法

BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

摘要

Support