EventVLA: 장기 지평 비전-언어-행동 정책을 위한 이벤트 기반 시각 증거 메모리

초록

메모리는 장기 로봇 조작에서 여전히 중요한 병목 현상으로 남아 있으며, 표준 비전-언어-행동(VLA) 정책은 시간이 지남에 따라 작업 관련 단서가 가려지거나 관찰 불가능해질 때 종종 실패한다. 기존의 메모리 증강 방법은 과거 맥락을 활용하지만, 심각한 정보 병목 현상을 겪거나, 분리된 이중 시스템을 통해 높은 지연 시간을 유발하거나, 엄청난 시각적 중복을 축적하는 비선별적 버퍼에 의존한다. 이러한 한계를 해결하기 위해, 우리는 희소 시각적 증거 메모리 개념에 기반한 엔드 투 엔드 프레임워크인 EventVLA를 소개한다. 이 프레임워크는 두 가지 핵심 구성 요소로 이루어져 있다: 초기 및 단기 맥락을 유지하는 기초 시각적 앵커와 동적 키프레임 증거 메모리(KEM) 모듈. 구체적으로, KEM은 VLA의 잠재 임베딩으로부터 미래 키프레임 확률을 직접 예측하여 희소하고 작업에 중요한 시각적 사건을 자율적으로 캡처 및 저장한다. 이러한 예측 기반 메커니즘은 정책이 현재 관찰의 미래 인과적 유용성을 동적으로 평가하여, 일시적 시각적 증거가 관찰 불가능해지기 전에 보존할 수 있게 한다. 또한, 상호작용 시각적 증거를 갖춘 비마르코프 조작 작업을 평가하기 위해 특별히 설계된 진단 벤치마크인 RoboTwin-MeM을 제안한다. 광범위한 평가 결과, 17개의 메모리 요구 시뮬레이션 작업과 4개의 실제 양손 작업에서 EventVLA는 최첨단 메모리 증강 VLA 대비 평균 성공률이 +40% 향상됨을 보여준다.

English

Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existing memory-augmented methods utilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparse visual evidence memory that comprises two core components: foundational visual anchors to retain initial and short-term contexts, and a dynamic Keyframe Evidence Memory (KEM) module. Specifically, KEM directly predicts future keyframe probabilities from the VLA's latent embeddings to autonomously capture and store sparse, task-critical visual events. This foresight-driven mechanism empowers the policy to dynamically evaluate the future causal utility of current observations, preserving transient visual evidence before it becomes unobservable. Furthermore, we propose RoboTwin-MeM, a diagnostic benchmark specifically designed to evaluate non-Markovian manipulation tasks with interactive visual evidence. Extensive evaluations show that across 17 memory-requiring simulation tasks and 4 real-world bimanual tasks, EventVLA achieves an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.