EvoVLA：自演进视觉-语言-动作模型

摘要

尽管视觉-语言-动作模型在零样本泛化和仿真到现实迁移方面取得进展，但长周期机器人操作仍是其面临的重要挑战。现有模型存在阶段幻觉问题——智能体利用粗糙的评估信号投机完成多步任务，虽报告高进展却未真正完成任务。我们提出EvoVLA，一种通过三个互补组件解决该问题的自监督VLA框架：采用三元对比学习与Gemini生成难负样本的阶段对齐奖励机制，可防止视觉捷径；基于相对物体-夹爪位姿而非原始像素的位姿驱动探索策略，将好奇心建立在空间关系上；通过选择性上下文保留与门控融合的长周期记忆模块，稳定长时运行中的内在塑造。在包含三项多阶段任务的Discoverse-L长周期操作基准测试中，EvoVLA较最强基线（OpenVLA-OFT）平均任务成功率提升10.2个百分点，达69.2%。该框架同时实现1.5倍的样本效率提升，并将阶段幻觉率从38.5%降至14.8%。在实体机器人上的实际部署显示，四项操作任务平均成功率达54.6%，较OpenVLA-OFT提升11个百分点，证明了有效的仿真到现实迁移能力与强泛化性。代码与项目网站详见：https://github.com/AIGeeksGroup/EvoVLA 与 https://aigeeksgroup.github.io/EvoVLA。

English

Long-horizon robotic manipulation remains challenging for Vision-Language-Action (VLA) models despite recent progress in zero-shot generalization and simulation-to-real-world transfer. Current VLA models suffer from stage hallucination, where agents exploit coarse evaluation signals to shortcut multi-step tasks, reporting high progress without truly completing them. We present EvoVLA, a self-supervised VLA framework that addresses this issue through three complementary components: Stage-Aligned Reward (SAR), which uses triplet contrastive learning with Gemini-generated hard negatives to prevent visual shortcuts; Pose-Based Object Exploration (POE), which grounds curiosity in relative object-gripper pose instead of raw pixels; and Long-Horizon Memory, which uses selective context retention and gated fusion to stabilize intrinsic shaping during extended rollouts. Extensive evaluations on Discoverse-L, a long-horizon manipulation benchmark with three multi-stage tasks, show that EvoVLA improves average task success by 10.2 percentage points over the strongest baseline (OpenVLA-OFT), reaching 69.2 percent. EvoVLA also achieves one-and-a-half times better sample efficiency and reduces stage hallucination from 38.5 percent to 14.8 percent. Real-world deployment on physical robots reaches an average success rate of 54.6 percent across four manipulation tasks, outperforming OpenVLA-OFT by 11 points, demonstrating effective sim-to-real transfer and strong generalization. Code: https://github.com/AIGeeksGroup/EvoVLA. Website: https://aigeeksgroup.github.io/EvoVLA.

EvoVLA：自演进视觉-语言-动作模型

EvoVLA: Self-Evolving Vision-Language-Action Model

摘要

Support