ChatPaper.aiChatPaper

引导视觉-语言-动作模型实现反探索:一种测试时扩展方法

Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach

December 2, 2025
作者: Siyuan Yang, Yang Zhang, Haoran He, Ling Pan, Xiu Li, Chenjia Bai, Xuelong Li
cs.AI

摘要

视觉-语言-动作(VLA)模型通过流匹配或扩散目标训练,擅长从大规模多模态数据集(如人类遥操作、脚本策略)中学习复杂行为。然而,由于VLA在预训练阶段融合了多样化的数据模式,而微调数据集常包含以运动学上次优或非理想方式收集的示范数据,其中存在与下游任务成功动作模式无关的冗余动作模式。具体而言,我们在预训练VLA监督微调后观察到,不同采样噪声在推理时存在显著脆弱性。本文认为这种不稳定性源于VLA策略与下游任务数据集的稳定成功模式所诱导策略之间的分布偏移。为此,我们提出TACO——一种测试时缩放(TTS)框架,采用轻量级伪计数估计器作为动作片段的高保真验证器。集成TACO的VLA模型可从所有采样动作片段中执行具有最大伪计数的动作,从而在保持VLA泛化能力的同时防止分布偏移(因约束仅应用于推理阶段)。我们的方法类似于离线强化学习(RL)中的经典反探索原理,且作为无梯度方法,相比RL更新具有显著计算优势,尤其对于因去噪过程难以进行RL更新的流或扩散型VLA模型。在四个仿真基准(RoboTwin2.0、Robotwin、LIBERO、SimplerEnv)及双机械臂平台上的大量实验表明,该方法能显著提升下游任务适应中的推理稳定性和成功率。
English
Vision-Language-Action (VLA) models, trained via flow-matching or diffusion objectives, excel at learning complex behaviors from large-scale, multi-modal datasets (e.g., human teleoperation, scripted policies). However, since VLAs incorporate diverse data modes in the pre-training stage, and the finetuning dataset often contains demonstration data collected in a kinematically suboptimal or undesirable way, it exists redundant action modes that are irrelevant to the success action modes of the downstream task. Specifically, we observe a critical inference-time fragility among various sampled noises after supervised finetuning of pre-trained VLAs. In this paper, we attribute this instability to the distribution shift between the VLA policy and the policy induced by stable success modes of the downstream task dataset. Thus, we propose TACO, a test-time-scaling (TTS) framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks. The VLA models integrated with TACO can execute the actions with maximum pseudo-count from all sampled action chunks, thereby preventing distribution shifts while preserving the generalization ability of VLAs since the constraint is applied only during inference. Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits compared to RL update, especially for flow or diffusion-based VLAs which are difficult to perform RL update due to denoising process. Extensive experiments across four simulation benchmarks (RoboTwin2.0, Robotwin, LIBERO, SimplerEnv) and a dual-arm platform demonstrate that our method significantly improves the inference stability and success rates in downstream-task adaptations.
PDF292December 5, 2025