MemEye：面向多模态智能体记忆的视觉中心评估框架

摘要

长期智能体记忆正日益向多模态发展，然而现有评估很少检验智能体是否能够保留用于后续推理的视觉证据。在先前工作中，许多基于视觉的问题只需依赖标题或文本痕迹即可回答，从而无需保留细粒度视觉证据即可推断答案。与此同时，需要对动态视觉状态进行推理的更难案例基本缺失。因此，我们提出MemEye框架，该框架从两个维度评估记忆能力：一个维度衡量关键视觉证据的粒度（从场景级到像素级证据），另一个维度衡量检索到的证据必须如何使用（从单一证据到演化综合）。在此框架下，我们构建了一个涵盖8个生活场景任务的新基准，并设置了基于消融的验证门，用于评估可回答性、捷径抵抗能力、视觉必要性和推理结构。通过评估4种视觉语言模型骨干上的13种记忆方法，我们发现现有架构仍难以保留细粒度视觉细节并随时间变化对状态变化进行推理。我们的研究结果表明，长期多模态记忆依赖于证据路由、时间追踪和细节提取。

English

Long-term agent memory is increasingly multimodal, yet existing evaluations rarely test whether agents preserve the visual evidence needed for later reasoning. In prior work, many visually grounded questions can be answered using only captions or textual traces, allowing answers to be inferred without preserving the fine-grained visual evidence. Meanwhile, harder cases that require reasoning over changing visual states are largely absent. Therefore, we introduce MemEye, a framework that evaluates memory capabilities from two dimensions: one measures the granularity of decisive visual evidence (from scene-level to pixel-level evidence), and the other measures how retrieved evidence must be used (from single evidence to evolutionary synthesis). Under this framework, we construct a new benchmark across 8 life-scenario tasks, with ablation-driven validation gates for assessing answerability, shortcut resistance, visual necessity, and reasoning structure. By evaluating 13 memory methods across 4 VLM backbones, we show that current architectures still struggle to preserve fine-grained visual details and reason about state changes over time. Our findings show that long-term multimodal memory depends on evidence routing, temporal tracking, and detail extraction.