SCALE：面向视觉-语言-动作模型的自不确定性条件自适应观察与执行框架

摘要

视觉-语言-动作（VLA）模型已成为通用机器人控制领域的重要范式，其中测试时扩展（TTS）技术通过增强鲁棒性而受到关注。然而，现有VLA的TTS方法需要额外训练、验证器和多次前向传播，难以实际部署。此外，这些方法仅在动作解码阶段进行干预，而保持视觉表征固定——这在感知模糊场景下存在不足，因为重新评估如何感知与决策行动同等重要。为解决这些局限，我们受主动推理理论中不确定性驱动探索的启发，提出SCALE这一无需额外训练、验证器和单次前向传播的推理策略，基于"自我不确定性"联合调节视觉感知与动作。SCALE能在高不确定性时拓宽感知与动作的探索空间，在置信度高时聚焦利用，从而实现跨场景的自适应执行。模拟与真实场景实验表明，SCALE不仅能提升先进VLA模型性能，还优于现有TTS方法，同时保持单次推理效率。

English

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods for VLAs require additional training, verifiers, and multiple forward passes, making them impractical for deployment. Moreover, they intervene only at action decoding while keeping visual representations fixed-insufficient under perceptual ambiguity, where reconsidering how to perceive is as important as deciding what to do. To address these limitations, we propose SCALE, a simple inference strategy that jointly modulates visual perception and action based on 'self-uncertainty', inspired by uncertainty-driven exploration in Active Inference theory-requiring no additional training, no verifier, and only a single forward pass. SCALE broadens exploration in both perception and action under high uncertainty, while focusing on exploitation when confident-enabling adaptive execution across varying conditions. Experiments on simulated and real-world benchmarks demonstrate that SCALE improves state-of-the-art VLAs and outperforms existing TTS methods while maintaining single-pass efficiency.

SCALE：面向视觉-语言-动作模型的自不确定性条件自适应观察与执行框架

SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

摘要

Support