无需验证器的视觉语言动作模型测试时采样

摘要

视觉-语言-动作模型（VLAs）在机器人控制领域展现了卓越的性能。然而，由于它们采用单一推理范式，在执行需要高精度的任务时仍存在根本性局限。尽管利用外部验证器进行测试时缩放的方法显示出潜力，但这些方法需要额外训练，并且难以泛化到未见过的条件。我们提出了掩码分布引导选择（MG-Select），这是一种新颖的测试时缩放框架，它利用模型内部属性，无需额外训练或外部模块。我们的方法采用KL散度作为置信度度量，从参考动作令牌分布中选择最优动作，该参考分布由同一VLA生成，但输入为随机掩码的状态和语言条件，确保在保持与目标任务分布一致的同时达到最大不确定性。此外，我们提出了一种联合训练策略，通过对状态和语言条件应用dropout，使模型能够同时学习条件分布和无条件分布，从而进一步提升参考分布的质量。实验表明，MG-Select实现了显著的性能提升，包括在现实世界分布内/分布外任务中分别提高了28%/35%，以及在仅用30次演示训练的RoboCasa拾取放置任务上获得了168%的相对增益。

English

Vision-Language-Action models (VLAs) have demonstrated remarkable performance in robot control. However, they remain fundamentally limited in tasks that require high precision due to their single-inference paradigm. While test-time scaling approaches using external verifiers have shown promise, they require additional training and fail to generalize to unseen conditions. We propose Masking Distribution Guided Selection (MG-Select), a novel test-time scaling framework for VLAs that leverages the model's internal properties without requiring additional training or external modules. Our approach utilizes KL divergence from a reference action token distribution as a confidence metric for selecting the optimal action from multiple candidates. We introduce a reference distribution generated by the same VLA but with randomly masked states and language conditions as inputs, ensuring maximum uncertainty while remaining aligned with the target task distribution. Additionally, we propose a joint training strategy that enables the model to learn both conditional and unconditional distributions by applying dropout to state and language conditions, thereby further improving the quality of the reference distribution. Our experiments demonstrate that MG-Select achieves significant performance improvements, including a 28%/35% improvement in real-world in-distribution/out-of-distribution tasks, along with a 168% relative gain on RoboCasa pick-and-place tasks trained with 30 demonstrations.

无需验证器的视觉语言动作模型测试时采样

Verifier-free Test-Time Sampling for Vision Language Action Models

摘要

Support