無驗證者視覺語言動作模型之測試時採樣

摘要

視覺-語言-動作模型（VLAs）在機器人控制領域展現了卓越的性能。然而，由於其單次推理範式，這些模型在需要高精度的任務中仍存在根本性限制。雖然使用外部驗證器的測試時縮放方法顯示出潛力，但它們需要額外的訓練，並且無法泛化到未見的條件。我們提出了掩碼分佈引導選擇（MG-Select），這是一種新穎的測試時縮放框架，用於VLAs，該框架利用模型的內部特性，無需額外的訓練或外部模塊。我們的方法使用KL散度作為置信度指標，從參考動作令牌分佈中選擇最佳動作。我們引入了一種由相同VLA生成的參考分佈，但使用隨機掩碼的狀態和語言條件作為輸入，確保最大不確定性，同時與目標任務分佈保持一致。此外，我們提出了一種聯合訓練策略，通過對狀態和語言條件應用dropout，使模型能夠學習條件和無條件分佈，從而進一步提高參考分佈的質量。我們的實驗表明，MG-Select實現了顯著的性能提升，包括在現實世界中的分佈內/分佈外任務中分別提升了28%/35%，以及在僅用30次示範訓練的RoboCasa拾取和放置任務中獲得了168%的相對增益。

English

Vision-Language-Action models (VLAs) have demonstrated remarkable performance in robot control. However, they remain fundamentally limited in tasks that require high precision due to their single-inference paradigm. While test-time scaling approaches using external verifiers have shown promise, they require additional training and fail to generalize to unseen conditions. We propose Masking Distribution Guided Selection (MG-Select), a novel test-time scaling framework for VLAs that leverages the model's internal properties without requiring additional training or external modules. Our approach utilizes KL divergence from a reference action token distribution as a confidence metric for selecting the optimal action from multiple candidates. We introduce a reference distribution generated by the same VLA but with randomly masked states and language conditions as inputs, ensuring maximum uncertainty while remaining aligned with the target task distribution. Additionally, we propose a joint training strategy that enables the model to learn both conditional and unconditional distributions by applying dropout to state and language conditions, thereby further improving the quality of the reference distribution. Our experiments demonstrate that MG-Select achieves significant performance improvements, including a 28%/35% improvement in real-world in-distribution/out-of-distribution tasks, along with a 168% relative gain on RoboCasa pick-and-place tasks trained with 30 demonstrations.

無驗證者視覺語言動作模型之測試時採樣

Verifier-free Test-Time Sampling for Vision Language Action Models

摘要

Support