M³Eval：通過基於認知基礎的視頻任務進行多模態記憶評估

摘要

隨著多模態模型朝長篇影片理解邁進，記憶成為一項關鍵能力。儘管在開發影片資料集與基準方面投入了大量努力，現有研究主要聚焦於感知與推理，卻尚未系統性地評估記憶：模型記住了什麼、資訊被保留得多麼忠實、以及記憶在干擾下仍能維持多少穩健性。為填補此缺口，我們提出 M^3Eval，這是首個專為探測多模態模型中不同記憶面向而設計的全面性評估框架與基準。本設計立基於認知心理學，透過精心建構的任務，孤立出記憶的關鍵面向。運用 M^3Eval，我們對代表性多模態模型進行了廣泛實驗，揭示出一致的弱點與獨特行為。我們發現：模型在處理並行影片串流時難以維持解纏表徵；其干擾模式與人類記憶中觀察到的模式有顯著差異；在空間域中記憶來源的定錨比時間域更可靠；以及展現出有限的符號記憶。整體而言，我們的基準為未來研究提供了有價值的資源，同時我們的發現凸顯了記憶作為一項根本但尚未充分開發的能力，並為在多模態模型中設計更有效的記憶機制提供了洞見。我們的程式碼與資料集可於 https://pku-value-lab.github.io/m3eval-homepage 取得。

English

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M^3Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M^3Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.