M^3Eval: 通过认知驱动的视频任务进行多模态记忆评估

摘要

随着多模态模型向长视频理解方向发展，记忆逐渐成为一项关键能力。尽管在视频数据集与基准测试方面已有大量投入，现有工作主要聚焦于感知与推理能力，缺乏对记忆能力的系统性评估：模型记住了什么、信息被保留下来的忠实程度如何、以及记忆在干扰条件下的鲁棒性如何。为弥补这一空白，我们提出M^3Eval——首个用于探究多模态模型中不同记忆维度的综合评估框架与基准。基于认知心理学理论，我们精心设计构建了能分离记忆关键方面的任务。借助M^3Eval，我们在代表性多模态模型上开展了大量实验，揭示了其普遍存在的弱点与独特行为模式。研究发现：模型在处理并行视频流时难以维持分离表征；其干扰模式与人类记忆存在显著差异；在空间域中的记忆溯源可靠性高于时间域；且符号记忆能力有限。总体而言，我们的基准为未来研究提供了宝贵资源，而研究结果则凸显了记忆这一基础但尚未充分探索的能力，并为设计更有效的多模态模型记忆机制提供了启示。我们的代码与数据集已在https://pku-value-lab.github.io/m3eval-homepage 上公开。

English

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M^3Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M^3Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.