贝叶斯视觉语言动作模型：基于潜在动作查询的贝叶斯分解方法

摘要

视觉-语言-动作（VLA）模型在机器人操作任务中展现出潜力，但往往难以泛化至新指令或复杂多任务场景。我们发现当前训练范式存在一个关键缺陷：目标驱动的数据收集会导致数据集偏差。这类数据集中，仅凭视觉观察即可高度预测语言指令，导致指令与动作之间的条件互信息趋近于零——这一现象被我们称为"信息坍缩"。其后果是模型退化为仅依赖视觉的策略，忽略语言约束并在分布外（OOD）场景中失效。为此，我们提出BayesianVLA新型框架，通过贝叶斯分解强制实现指令跟随。通过引入可学习的潜在动作查询，我们构建双分支架构来同时估计仅视觉先验p(a|v)和语言条件后验π(a|v,l)，进而优化策略以最大化动作与指令的条件点间互信息（PMI）。该目标函数有效惩罚视觉捷径，并对显式解释语言指令的动作给予奖励。在不需新数据的情况下，BayesianVLA显著提升泛化能力。在SimplerEnv和RoboCasa上的大量实验表明该方法取得显著进步，其中在挑战性OOD基准SimplerEnv上提升11.3%，验证了我们所提方法在动作中鲁棒扎根语言的能力。

English

Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose BayesianVLA, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior p(a mid v) and a language-conditioned posterior π(a mid v, ell). We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, BayesianVLA significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.

贝叶斯视觉语言动作模型：基于潜在动作查询的贝叶斯分解方法

BayesianVLA: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

摘要

Support