ChatPaper.aiChatPaper

引導視覺-語言-動作模型作為反探索:一種測試時縮放方法

Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach

December 2, 2025
作者: Siyuan Yang, Yang Zhang, Haoran He, Ling Pan, Xiu Li, Chenjia Bai, Xuelong Li
cs.AI

摘要

基於流匹配或擴散目標訓練的視覺-語言-動作模型,能夠從大規模多模態數據集(如人類遙控操作、腳本策略)中有效學習複雜行為。然而,由於VLA在預訓練階段整合了多樣化的數據模式,而微調數據集往往包含以運動學上次優或不理想方式收集的示範數據,這使得模型存在與下游任務成功動作模式無關的冗餘動作模式。具體而言,我們在預訓練VLA經過監督微調後,觀察到不同採樣噪聲會導致關鍵的推理時脆弱性。本文將此不穩定性歸因於VLA策略與下游任務數據集的穩定成功模式所誘導策略之間的分佈偏移。為此,我們提出TACO——一種測試時縮放框架,採用輕量級偽計數估計器作為動作區塊的高保真驗證器。整合TACO的VLA模型能從所有採樣動作區塊中執行具有最大偽計數值的動作,既防止分佈偏移,又因約束僅在推理時應用而保留VLA的泛化能力。該方法類似於離線強化學習中的經典反探索原理,且作為無梯度方法,相較於RL更新(尤其是基於流或擴散的VLA因去噪過程難以執行RL更新)具有顯著計算優勢。在四個仿真基準測試平台(RoboTwin2.0、Robotwin、LIBERO、SimplerEnv)及雙臂機器人平台上的大量實驗表明,本方法能顯著提升下游任務適應中的推理穩定性與成功率。
English
Vision-Language-Action (VLA) models, trained via flow-matching or diffusion objectives, excel at learning complex behaviors from large-scale, multi-modal datasets (e.g., human teleoperation, scripted policies). However, since VLAs incorporate diverse data modes in the pre-training stage, and the finetuning dataset often contains demonstration data collected in a kinematically suboptimal or undesirable way, it exists redundant action modes that are irrelevant to the success action modes of the downstream task. Specifically, we observe a critical inference-time fragility among various sampled noises after supervised finetuning of pre-trained VLAs. In this paper, we attribute this instability to the distribution shift between the VLA policy and the policy induced by stable success modes of the downstream task dataset. Thus, we propose TACO, a test-time-scaling (TTS) framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks. The VLA models integrated with TACO can execute the actions with maximum pseudo-count from all sampled action chunks, thereby preventing distribution shifts while preserving the generalization ability of VLAs since the constraint is applied only during inference. Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits compared to RL update, especially for flow or diffusion-based VLAs which are difficult to perform RL update due to denoising process. Extensive experiments across four simulation benchmarks (RoboTwin2.0, Robotwin, LIBERO, SimplerEnv) and a dual-arm platform demonstrate that our method significantly improves the inference stability and success rates in downstream-task adaptations.
PDF292December 5, 2025