MemEye：一個以視覺為中心的多模態智能體記憶評估框架

摘要

長期代理人記憶日益多模態，然而現有評測鮮少測試代理人是否能保留後續推理所需的視覺證據。在先前的相關研究中，許多基於視覺訊息的問題可僅透過圖說或文字軌跡作答，使答案無需保留細粒度視覺證據即可推得。與此同時，需要對變化中的視覺狀態進行推理的較難案例則幾乎付之闕如。為此，我們提出MemEye，這是一個從兩個面向評估記憶能力的框架：其一衡量關鍵視覺證據的粒度（從場景層級到像素層級證據），另一衡量所擷取證據必須如何運用（從單一證據到演化式綜合）。在此框架下，我們構建了一個橫跨八項生活場景任務的新基準，並設置基於消融分析的驗證閘門，用以評估可回答性、捷徑規避性、視覺必要性及推理結構。透過對四種視覺語言模型骨幹中的十三種記憶方法進行評估，我們顯示現有架構在保存細粒度視覺細節及隨時間推理狀態變化方面仍顯吃力。我們的研究發現表明，長期多模態記憶依賴於證據路由、時間追蹤及細節萃取。

English

Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred without preserving the fine-grained visual evidence. Meanwhile, harder cases that require reasoning over changing visual states are largely absent. Therefore, we introduce MemEye, a framework that evaluates memory capabilities from two dimensions: one measures the granularity of decisive visual evidence (from scene-level to pixel-level evidence), and the other measures how retrieved evidence must be used (from single evidence to evolutionary synthesis). Under this framework, we construct a new benchmark across 8 life-scenario tasks, with ablation-driven validation gates for assessing answerability, shortcut resistance, visual necessity, and reasoning structure. By evaluating 13 memory methods across 4 VLM backbones, we show that current architectures still struggle to preserve fine-grained visual details and reason about state changes over time. Our findings show that long-term multimodal memory depends on evidence routing, temporal tracking, and detail extraction.