EventVLA: 長期視覚・言語・行動ポリシーのためのイベント駆動型視覚証拠メモリ

要旨

メモリは長期的なロボット操作における重要なボトルネックであり続けている。標準的なVision-Language-Action（VLA）ポリシーは、タスクに関連する手がかりが時間経過とともに遮蔽されたり観測不可能になったりすると、しばしば失敗するからである。既存のメモリ拡張手法は履歴コンテキストを利用するものの、深刻な情報ボトルネックに悩まされるか、分離されたデュアルシステムによる高レイテンシを招くか、あるいは膨大な視覚的冗長性を蓄積する非選択的バッファに依存している。これらの制限に対処するため、我々はスパースな視覚証拠メモリの概念に基づくエンドツーエンドフレームワークであるEventVLAを導入する。これは2つの中核コンポーネントから構成される。初期および短期コンテキストを保持するための基礎的視覚アンカーと、動的キーフレーム証拠メモリ（KEM）モジュールである。具体的には、KEMはVLAの潜在埋め込みから将来のキーフレーム確率を直接予測し、スパースでタスクに重要な視覚イベントを自律的に捕捉・保存する。この先見性に基づくメカニズムにより、ポリシーは現在の観測の将来の因果的有用性を動的に評価し、観測不可能になる前に一時的な視覚証拠を保存することが可能になる。さらに、我々はインタラクティブな視覚証拠を用いた非マルコフ操作タスクを評価するために特別に設計された診断用ベンチマークRoboTwin-MeMを提案する。広範な評価により、メモリを必要とする17のシミュレーションタスクと4つの実世界の両腕操作タスクにおいて、EventVLAは最先端のメモリ拡張VLAと比較して平均成功率が+40%向上することを示している。

English

Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existing memory-augmented methods utilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparse visual evidence memory that comprises two core components: foundational visual anchors to retain initial and short-term contexts, and a dynamic Keyframe Evidence Memory (KEM) module. Specifically, KEM directly predicts future keyframe probabilities from the VLA's latent embeddings to autonomously capture and store sparse, task-critical visual events. This foresight-driven mechanism empowers the policy to dynamically evaluate the future causal utility of current observations, preserving transient visual evidence before it becomes unobservable. Furthermore, we propose RoboTwin-MeM, a diagnostic benchmark specifically designed to evaluate non-Markovian manipulation tasks with interactive visual evidence. Extensive evaluations show that across 17 memory-requiring simulation tasks and 4 real-world bimanual tasks, EventVLA achieves an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.