EVOLVE-VLA:基於環境回饋的視覺-語言-動作模型測試時訓練
EVOLVE-VLA: Test-Time Training from Environment Feedback for Vision-Language-Action Models
December 16, 2025
作者: Zechen Bai, Chen Gao, Mike Zheng Shou
cs.AI
摘要
实现真正自适应的具身智能需要智能体不仅通过模仿静态示范来学习,更要通过环境交互持续改进——这类似于人类通过实践掌握技能的方式。视觉-语言-动作模型通过利用大语言模型推动了机器人操作技术的进步,但其根本仍受限于监督微调范式:每个任务需数百次示范、机械记忆轨迹,且当部署条件偏离训练场景时无法适应。我们提出EVOLVE-VLA这一测试时训练框架,使VLA模型能够通过环境交互持续自适应,且仅需极少或零任务特定示范。核心技术挑战在于用自主反馈替代测试时不可得的理想奖励信号。我们通过学得的进度估计器提供密集反馈来解决该问题,并关键性地通过双重机制"驯服"这一固有噪声信号:(1)累积式进度估计机制平滑噪声点估计,(2)渐进式跨度扩展策略实现策略逐步演化。EVOLVE-VLA取得显著提升:长跨度任务提升8.6%,单样本学习提升22.0%,并实现跨任务泛化——在无任务特定示范训练时,未见任务成功率达20.8%(纯SFT方法为0%)。定性分析揭示了示范中未出现的新兴能力,包括错误恢复和创新策略。这项研究标志着VLA向真正学习与自适应迈出关键一步,从静态模仿转向持续自我改进。
English
Achieving truly adaptive embodied intelligence requires agents that learn not just by imitating static demonstrations, but by continuously improving through environmental interaction, which is akin to how humans master skills through practice. Vision-Language-Action (VLA) models have advanced robotic manipulation by leveraging large language models, yet remain fundamentally limited by Supervised Finetuning (SFT): requiring hundreds of demonstrations per task, rigidly memorizing trajectories, and failing to adapt when deployment conditions deviate from training. We introduce EVOLVE-VLA, a test-time training framework enabling VLAs to continuously adapt through environment interaction with minimal or zero task-specific demonstrations. The key technical challenge is replacing oracle reward signals (unavailable at test time) with autonomous feedback. We address this through a learned progress estimator providing dense feedback, and critically, we design our framework to ``tame'' this inherently noisy signal via two mechanisms: (1) an accumulative progress estimation mechanism smoothing noisy point-wise estimates, and (2) a progressive horizon extension strategy enabling gradual policy evolution. EVOLVE-VLA achieves substantial gains: +8.6\% on long-horizon tasks, +22.0\% in 1-shot learning, and enables cross-task generalization -- achieving 20.8\% success on unseen tasks without task-specific demonstrations training (vs. 0\% for pure SFT). Qualitative analysis reveals emergent capabilities absent in demonstrations, including error recovery and novel strategies. This work represents a critical step toward VLAs that truly learn and adapt, moving beyond static imitation toward continuous self-improvements.