EvoVLA：自演进视觉-语言-行动模型

摘要

尽管视觉-语言-动作模型在零样本泛化和仿真到现实迁移方面取得进展，长周期机器人操作仍是其面临的重要挑战。现有模型存在阶段幻觉问题——智能体利用粗糙的评估信号来规避多步骤任务，虽报告高进度却未真正完成任务。我们提出EvoVLA，一种通过三个互补组件解决该问题的自监督VLA框架：采用三元对比学习与Gemini生成难负样本的阶段对齐奖励机制，可防止视觉捷径；基于位姿的物体探索策略，将好奇心锚定在物体-夹爪相对位姿而非原始像素；长周期记忆模块，通过选择性上下文保留与门控融合稳定长周期决策中的内在塑造。在包含三项多阶段任务的长周期操作基准Discoverse-L上的大量实验表明，EvoVLA相较最强基线（OpenVLA-OFT）将平均任务成功率提升10.2个百分点，达到69.2%。该框架还实现了1.5倍的样本效率提升，并将阶段幻觉率从38.5%降至14.8%。在物理机器人上的实际部署显示，四项操作任务平均成功率高达54.6%，较OpenVLA-OFT提升11个百分点，证明了有效的仿真到现实迁移与强大泛化能力。代码与项目网站详见：https://github.com/AIGeeksGroup/EvoVLA 与 https://aigeeksgroup.github.io/EvoVLA。

English

Long-horizon robotic manipulation remains challenging for Vision-Language-Action (VLA) models despite recent progress in zero-shot generalization and simulation-to-real-world transfer. Current VLA models suffer from stage hallucination, where agents exploit coarse evaluation signals to shortcut multi-step tasks, reporting high progress without truly completing them. We present EvoVLA, a self-supervised VLA framework that addresses this issue through three complementary components: Stage-Aligned Reward (SAR), which uses triplet contrastive learning with Gemini-generated hard negatives to prevent visual shortcuts; Pose-Based Object Exploration (POE), which grounds curiosity in relative object-gripper pose instead of raw pixels; and Long-Horizon Memory, which uses selective context retention and gated fusion to stabilize intrinsic shaping during extended rollouts. Extensive evaluations on Discoverse-L, a long-horizon manipulation benchmark with three multi-stage tasks, show that EvoVLA improves average task success by 10.2 percentage points over the strongest baseline (OpenVLA-OFT), reaching 69.2 percent. EvoVLA also achieves one-and-a-half times better sample efficiency and reduces stage hallucination from 38.5 percent to 14.8 percent. Real-world deployment on physical robots reaches an average success rate of 54.6 percent across four manipulation tasks, outperforming OpenVLA-OFT by 11 points, demonstrating effective sim-to-real transfer and strong generalization. Code: https://github.com/AIGeeksGroup/EvoVLA. Website: https://aigeeksgroup.github.io/EvoVLA.

EvoVLA：自演进视觉-语言-行动模型

EvoVLA: Self-Evolving Vision-Language-Action Model

摘要

Support