視覚言語行動モデルにおける検証不要なテスト時サンプリング

要旨

Vision-Language-Actionモデル（VLA）は、ロボット制御において顕著な性能を発揮してきた。しかし、単一推論パラダイムに基づくため、高精度を要するタスクにおいては根本的な限界がある。外部検証器を用いたテスト時スケーリング手法は有望であるものの、追加の訓練を必要とし、未見の条件への汎化が困難である。本研究では、追加の訓練や外部モジュールを必要とせず、モデルの内部特性を活用する新たなテスト時スケーリングフレームワークであるMasking Distribution Guided Selection（MG-Select）を提案する。本手法では、参照アクショントークン分布からのKLダイバージェンスを信頼度指標として用い、複数の候補から最適なアクションを選択する。参照分布は、同じVLAによって生成されるが、ランダムにマスクされた状態と言語条件を入力とし、最大の不確実性を保ちつつ目標タスク分布に整合するように設計されている。さらに、状態と言語条件にドロップアウトを適用することで、モデルが条件付きおよび無条件分布の両方を学習することを可能にする共同訓練戦略を提案し、参照分布の品質をさらに向上させる。実験結果から、MG-Selectは実世界の分布内/分布外タスクにおいてそれぞれ28%/35%の性能向上を達成し、30回のデモンストレーションで訓練されたRoboCasaのピックアンドプレースタスクにおいて168%の相対的な改善を示した。

English

Vision-Language-Action models (VLAs) have demonstrated remarkable performance in robot control. However, they remain fundamentally limited in tasks that require high precision due to their single-inference paradigm. While test-time scaling approaches using external verifiers have shown promise, they require additional training and fail to generalize to unseen conditions. We propose Masking Distribution Guided Selection (MG-Select), a novel test-time scaling framework for VLAs that leverages the model's internal properties without requiring additional training or external modules. Our approach utilizes KL divergence from a reference action token distribution as a confidence metric for selecting the optimal action from multiple candidates. We introduce a reference distribution generated by the same VLA but with randomly masked states and language conditions as inputs, ensuring maximum uncertainty while remaining aligned with the target task distribution. Additionally, we propose a joint training strategy that enables the model to learn both conditional and unconditional distributions by applying dropout to state and language conditions, thereby further improving the quality of the reference distribution. Our experiments demonstrate that MG-Select achieves significant performance improvements, including a 28%/35% improvement in real-world in-distribution/out-of-distribution tasks, along with a 168% relative gain on RoboCasa pick-and-place tasks trained with 30 demonstrations.

視覚言語行動モデルにおける検証不要なテスト時サンプリング

Verifier-free Test-Time Sampling for Vision Language Action Models

要旨

Support