비전 언어 행동 모델을 위한 검증자 없는 테스트 타임 샘플링

초록

비전-언어-행동 모델(VLAs)은 로봇 제어에서 뛰어난 성능을 입증했습니다. 그러나 단일 추론 패러다임으로 인해 높은 정밀도를 요구하는 작업에서는 근본적인 한계를 보입니다. 외부 검증기를 사용한 테스트 시간 스케일링 접근법이 유망한 결과를 보였지만, 추가적인 훈련이 필요하며 보이지 않는 조건에 일반화하지 못하는 문제가 있습니다. 우리는 추가 훈련이나 외부 모듈 없이 모델의 내부 속성을 활용하는 새로운 테스트 시간 스케일링 프레임워크인 마스킹 분포 가이드 선택(MG-Select)을 제안합니다. 우리의 접근법은 참조 행동 토큰 분포와의 KL 발산을 신뢰도 지표로 사용하여 여러 후보 중 최적의 행동을 선택합니다. 동일한 VLA에 의해 생성되지만 무작위로 마스킹된 상태와 언어 조건을 입력으로 사용하는 참조 분포를 도입하여, 목표 작업 분포와 일치하면서도 최대 불확실성을 보장합니다. 또한, 상태와 언어 조건에 드롭아웃을 적용하여 조건부 및 무조건부 분포를 모두 학습할 수 있는 공동 훈련 전략을 제안함으로써 참조 분포의 품질을 더욱 향상시킵니다. 우리의 실험 결과, MG-Select는 실제 세계의 분포 내/분포 외 작업에서 각각 28%/35%의 성능 향상을 달성했으며, 30개의 데모로 훈련된 RoboCasa 피크 앤 플레이스 작업에서 168%의 상대적 이득을 보였습니다.

English

Vision-Language-Action models (VLAs) have demonstrated remarkable performance in robot control. However, they remain fundamentally limited in tasks that require high precision due to their single-inference paradigm. While test-time scaling approaches using external verifiers have shown promise, they require additional training and fail to generalize to unseen conditions. We propose Masking Distribution Guided Selection (MG-Select), a novel test-time scaling framework for VLAs that leverages the model's internal properties without requiring additional training or external modules. Our approach utilizes KL divergence from a reference action token distribution as a confidence metric for selecting the optimal action from multiple candidates. We introduce a reference distribution generated by the same VLA but with randomly masked states and language conditions as inputs, ensuring maximum uncertainty while remaining aligned with the target task distribution. Additionally, we propose a joint training strategy that enables the model to learn both conditional and unconditional distributions by applying dropout to state and language conditions, thereby further improving the quality of the reference distribution. Our experiments demonstrate that MG-Select achieves significant performance improvements, including a 28%/35% improvement in real-world in-distribution/out-of-distribution tasks, along with a 168% relative gain on RoboCasa pick-and-place tasks trained with 30 demonstrations.

비전 언어 행동 모델을 위한 검증자 없는 테스트 타임 샘플링

Verifier-free Test-Time Sampling for Vision Language Action Models

초록

Support