EVOLVE-VLA:基于环境反馈的视觉-语言-动作模型测试时训练
EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models
December 16, 2025
作者: Zechen Bai, Chen Gao, Mike Zheng Shou
cs.AI
摘要
实现真正自适应的具身智能需要智能体不仅通过模仿静态演示来学习,更要通过环境交互持续改进——这类似于人类通过实践掌握技能的方式。视觉-语言-动作模型虽通过利用大语言模型推动了机器人操作的发展,但其本质上仍受监督微调的限制:每个任务需数百次演示、机械记忆轨迹,且当部署条件偏离训练场景时无法适应。我们提出EVOLVE-VLA,一种测试时训练框架,使VLA模型能够通过环境交互持续自适应,仅需极少或无需任务特定演示。核心技术挑战在于用自主反馈替代测试时无法获取的预设奖励信号。我们通过设计进度估计器提供密集反馈来解决该问题,并创新性地通过双重机制“驯服”这一固有噪声信号:(1)累积进度估计机制平滑噪声点估计;(2)渐进式跨度扩展策略实现策略逐步演化。EVOLVE-VLA取得显著提升:长跨度任务提升8.6%,单样本学习提升22.0%,并实现跨任务泛化——在未见过任务上无需任务特定演示训练即可达到20.8%成功率(纯SFT方法为0%)。定性分析揭示了演示中未出现的新兴能力,包括错误恢复与创新策略。这项工作标志着VLA向真正学习与自适应迈出关键一步,从静态模仿走向持续自我改进。
English
Achieving truly adaptive embodied intelligence requires agents that learn not just by imitating static demonstrations, but by continuously improving through environmental interaction, which is akin to how humans master skills through practice. Vision-Language-Action (VLA) models have advanced robotic manipulation by leveraging large language models, yet remain fundamentally limited by Supervised Finetuning (SFT): requiring hundreds of demonstrations per task, rigidly memorizing trajectories, and failing to adapt when deployment conditions deviate from training. We introduce EVOLVE-VLA, a test-time training framework enabling VLAs to continuously adapt through environment interaction with minimal or zero task-specific demonstrations. The key technical challenge is replacing oracle reward signals (unavailable at test time) with autonomous feedback. We address this through a learned progress estimator providing dense feedback, and critically, we design our framework to ``tame'' this inherently noisy signal via two mechanisms: (1) an accumulative progress estimation mechanism smoothing noisy point-wise estimates, and (2) a progressive horizon extension strategy enabling gradual policy evolution. EVOLVE-VLA achieves substantial gains: +8.6\% on long-horizon tasks, +22.0\% in 1-shot learning, and enables cross-task generalization -- achieving 20.8\% success on unseen tasks without task-specific demonstrations training (vs. 0\% for pure SFT). Qualitative analysis reveals emergent capabilities absent in demonstrations, including error recovery and novel strategies. This work represents a critical step toward VLAs that truly learn and adapt, moving beyond static imitation toward continuous self-improvements.