SCALE:面向视觉-语言-动作模型的自不确定性条件自适应观察与执行框架
SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models
February 4, 2026
作者: Hyeonbeom Choi, Daechul Ahn, Youhan Lee, Taewook Kang, Seongwon Cho, Jonghyun Choi
cs.AI
摘要
视觉-语言-动作(VLA)模型已成为通用机器人控制领域的重要范式,其中测试时扩展(TTS)技术通过增强训练外场景的鲁棒性而备受关注。然而,现有VLA的TTS方法需额外训练、验证器及多次前向传播,难以实际部署。此外,这些方法仅干预动作解码而保持视觉表征固定——这在感知模糊场景中存在局限,因为重新评估感知方式与决策行动同等重要。为解决这些问题,受主动推理理论中不确定性驱动探索的启发,我们提出SCALE:一种基于"自我不确定性"联合调节视觉感知与动作的简易推理策略。该方法无需额外训练、无需验证器,且仅需单次前向传播。SCALE能在高不确定性时拓宽感知与动作的探索空间,在置信度高时聚焦于利用策略,从而实现跨场景的自适应执行。仿真与真实场景实验表明,SCALE不仅能提升前沿VLA模型性能,在保持单次推理效率的同时也优于现有TTS方法。
English
Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods for VLAs require additional training, verifiers, and multiple forward passes, making them impractical for deployment. Moreover, they intervene only at action decoding while keeping visual representations fixed-insufficient under perceptual ambiguity, where reconsidering how to perceive is as important as deciding what to do. To address these limitations, we propose SCALE, a simple inference strategy that jointly modulates visual perception and action based on 'self-uncertainty', inspired by uncertainty-driven exploration in Active Inference theory-requiring no additional training, no verifier, and only a single forward pass. SCALE broadens exploration in both perception and action under high uncertainty, while focusing on exploitation when confident-enabling adaptive execution across varying conditions. Experiments on simulated and real-world benchmarks demonstrate that SCALE improves state-of-the-art VLAs and outperforms existing TTS methods while maintaining single-pass efficiency.