M³Eval: 인지 기반 비디오 과제를 통한 다중 모드 기억 평가

초록

멀티모달 모델이 장기 비디오 이해로 발전함에 따라, 기억은 핵심적인 능력으로 부상하고 있다. 비디오 데이터셋과 벤치마크 개발에 상당한 노력이 투입되었음에도 불구하고, 기존 연구는 주로 지각과 추론에 초점을 맞추고 있으며, 모델이 무엇을 기억하는지, 정보가 얼마나 충실히 보존되는지, 간섭 하에서 기억이 얼마나 견고한지 등 기억을 체계적으로 평가하지 않는다. 이러한 공백을 해소하기 위해, 우리는 M^3Eval을 소개한다. 이는 멀티모달 모델의 다양한 기억 차원을 탐구하기 위한 최초의 포괄적 평가 프레임워크이자 벤치마크이다. 인지 심리학에 기반하여, 우리의 설계는 기억의 주요 측면을 분리해내는 정교하게 구성된 과제들을 특징으로 한다. M^3Eval을 활용하여 대표적인 멀티모달 모델들에 대한 광범위한 실험을 수행한 결과, 일관된 취약점과 독특한 행동 양식이 드러났다. 우리는 모델들이 병렬 비디오 스트림을 처리할 때 분리된 표현을 유지하는 데 어려움을 겪으며, 인간 기억에서 관찰되는 것과는 상당히 다른 간섭 패턴을 보이고, 시간적 영역보다 공간적 영역에서 기억 원천을 더 신뢰성 있게 근거하며, 제한된 상징적 기억을 나타냄을 발견했다. 종합적으로, 우리의 벤치마크는 향후 연구에 valuable한 자원을 제공하며, 우리의 발견은 기억이 아직 충분히 탐구되지 않은 근본적인 능력임을 강조하고, 멀티모달 모델에서 보다 효과적인 기억 메커니즘을 설계하기 위한 통찰력을 제시한다. 코드와 데이터셋은 https://pku-value-lab.github.io/m3eval-homepage 에서 확인할 수 있다.

English

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M^3Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M^3Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.