M^3Eval: 認知基盤ビデオタスクによるマルチモーダル記憶評価

要旨

マルチモーダルモデルが長時間動画理解へと進化するにつれ、記憶は重要な能力として浮上している。動画データセットやベンチマークの開発に多大な努力が注がれてきたものの、既存の研究は主に知覚と推論に焦点を当てており、記憶を体系的に評価していない。すなわち、モデルが何を保持するのか、情報がどの程度忠実に保存されるのか、干渉下で記憶がどの程度頑健であるのか、といった点である。このギャップを埋めるため、我々はM^3Evalを提案する。これは、マルチモーダルモデルにおける異なる記憶次元を探るための初の包括的評価フレームワーク兼ベンチマークである。認知心理学に基づき、我々の設計では記憶の重要な側面を抽出するように注意深く構築されたタスクを特徴とする。M^3Evalを活用し、代表的なマルチモーダルモデルに対して広範な実験を行い、一貫した弱点と特徴的な振る舞いを明らかにした。具体的には、モデルは並列的な動画ストリームを処理する際に分離された表現を維持することに苦労し、人間の記憶で観察されるものとは大きく異なる干渉パターンを示し、時間領域よりも空間領域においてより確実に記憶の源泉を特定し、限定的な記号記憶を示すことがわかった。総じて、我々のベンチマークは将来の研究にとって貴重なリソースを提供する。一方で、我々の発見は記憶が基本的でありながら未解明の能力であることを強調し、マルチモーダルモデルにおけるより効果的な記憶メカニズムを設計するための洞察を提供する。コードとデータセットはhttps://pku-value-lab.github.io/m3eval-homepageで入手可能である。

English

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M^3Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M^3Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.