MemEye: マルチモーダルエージェントメモリのための視覚中心評価フレームワーク

要旨

長期エージェントメモリはますますマルチモーダル化しているが、既存の評価手法では、エージェントが後続の推論に必要な視覚的証拠を保持しているかどうかを検証することはほとんどない。従来の研究では、多くの視覚的な質問がキャプションやテキストトレースのみで回答可能であり、細粒度の視覚的証拠を保持しなくても推論できるケースが存在した。一方で、変化する視覚状態に基づく推論が必要な困難な事例はほとんど扱われていない。そこで我々は、メモリ能力を2つの次元から評価するフレームワークMemEyeを提案する。第1の次元は、決定的な視覚的証拠の粒度（シーンレベルからピクセルレベルまで）を測定し、第2の次元は、取得した証拠の利用方法（単一証拠から進化的統合まで）を測定する。本フレームワークに基づき、8つの生活シナリオタスクにわたる新たなベンチマークを構築し、アブレーション駆動型の検証ゲートを用いて、解答可能性、近道回避性、視覚的必要性、推論構造を評価する。4つのVLMバックボーンにおける13のメモリ手法を評価した結果、現在のアーキテクチャでは細粒度の視覚的詳細を保持し、時間経過に伴う状態変化を推論することが依然として困難であることが明らかになった。我々の知見は、長期マルチモーダルメモリが証拠のルーティング、時間的追跡、詳細抽出に依存していることを示している。

English

Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred without preserving the fine-grained visual evidence. Meanwhile, harder cases that require reasoning over changing visual states are largely absent. Therefore, we introduce MemEye, a framework that evaluates memory capabilities from two dimensions: one measures the granularity of decisive visual evidence (from scene-level to pixel-level evidence), and the other measures how retrieved evidence must be used (from single evidence to evolutionary synthesis). Under this framework, we construct a new benchmark across 8 life-scenario tasks, with ablation-driven validation gates for assessing answerability, shortcut resistance, visual necessity, and reasoning structure. By evaluating 13 memory methods across 4 VLM backbones, we show that current architectures still struggle to preserve fine-grained visual details and reason about state changes over time. Our findings show that long-term multimodal memory depends on evidence routing, temporal tracking, and detail extraction.