EventVLA: 事件驱动的视觉证据记忆用于长时域视觉-语言-动作策略

摘要

记忆仍然是长时域机器人操作中的关键瓶颈，因为标准的视觉-语言-动作（VLA）策略在任务相关线索随时间变得遮挡或不可观测时往往失败。现有的记忆增强方法虽利用了历史上下文，但要么遭受严重的信息瓶颈，要么通过解耦的双系统引入高延迟，要么依赖不加选择的缓冲区积累大量视觉冗余。为解决这些局限，我们提出EventVLA，一种基于稀疏视觉证据记忆概念的端到端框架，包含两个核心组件：用于保留初始和短期上下文的基础视觉锚点，以及动态关键帧证据记忆（KEM）模块。具体而言，KEM直接从VLA的潜在嵌入中预测未来关键帧概率，从而自主捕获并存储稀疏的、任务关键的视觉事件。这种前瞻驱动机制使策略能够动态评估当前观测的未来因果效用，在瞬态视觉证据变得不可观测之前将其保留。此外，我们提出RoboTwin-MeM，一个专门设计的诊断基准，用于评估具有交互式视觉证据的非马尔可夫操作任务。大量评估表明，在17个需要记忆的仿真任务和4个真实世界双臂任务中，EventVLA相较于最先进的记忆增强VLA实现了平均+40%的成功率提升。

English

Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existing memory-augmented methods utilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparse visual evidence memory that comprises two core components: foundational visual anchors to retain initial and short-term contexts, and a dynamic Keyframe Evidence Memory (KEM) module. Specifically, KEM directly predicts future keyframe probabilities from the VLA's latent embeddings to autonomously capture and store sparse, task-critical visual events. This foresight-driven mechanism empowers the policy to dynamically evaluate the future causal utility of current observations, preserving transient visual evidence before it becomes unobservable. Furthermore, we propose RoboTwin-MeM, a diagnostic benchmark specifically designed to evaluate non-Markovian manipulation tasks with interactive visual evidence. Extensive evaluations show that across 17 memory-requiring simulation tasks and 4 real-world bimanual tasks, EventVLA achieves an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.