无需验证器的视觉语言动作模型测试时采样
Verifier-free Test-Time Sampling for Vision Language Action Models
October 7, 2025
作者: Suhyeok Jang, Dongyoung Kim, Changyeon Kim, Youngsuk Kim, Jinwoo Shin
cs.AI
摘要
视觉-语言-动作模型(VLAs)在机器人控制领域展现了卓越的性能。然而,由于它们采用单一推理范式,在执行需要高精度的任务时仍存在根本性局限。尽管利用外部验证器进行测试时缩放的方法显示出潜力,但这些方法需要额外训练,并且难以泛化到未见过的条件。我们提出了掩码分布引导选择(MG-Select),这是一种新颖的测试时缩放框架,它利用模型内部属性,无需额外训练或外部模块。我们的方法采用KL散度作为置信度度量,从参考动作令牌分布中选择最优动作,该参考分布由同一VLA生成,但输入为随机掩码的状态和语言条件,确保在保持与目标任务分布一致的同时达到最大不确定性。此外,我们提出了一种联合训练策略,通过对状态和语言条件应用dropout,使模型能够同时学习条件分布和无条件分布,从而进一步提升参考分布的质量。实验表明,MG-Select实现了显著的性能提升,包括在现实世界分布内/分布外任务中分别提高了28%/35%,以及在仅用30次演示训练的RoboCasa拾取放置任务上获得了168%的相对增益。
English
Vision-Language-Action models (VLAs) have demonstrated remarkable performance
in robot control. However, they remain fundamentally limited in tasks that
require high precision due to their single-inference paradigm. While test-time
scaling approaches using external verifiers have shown promise, they require
additional training and fail to generalize to unseen conditions. We propose
Masking Distribution Guided Selection (MG-Select), a novel test-time scaling
framework for VLAs that leverages the model's internal properties without
requiring additional training or external modules. Our approach utilizes KL
divergence from a reference action token distribution as a confidence metric
for selecting the optimal action from multiple candidates. We introduce a
reference distribution generated by the same VLA but with randomly masked
states and language conditions as inputs, ensuring maximum uncertainty while
remaining aligned with the target task distribution. Additionally, we propose a
joint training strategy that enables the model to learn both conditional and
unconditional distributions by applying dropout to state and language
conditions, thereby further improving the quality of the reference
distribution. Our experiments demonstrate that MG-Select achieves significant
performance improvements, including a 28%/35% improvement in real-world
in-distribution/out-of-distribution tasks, along with a 168% relative gain on
RoboCasa pick-and-place tasks trained with 30 demonstrations.