MemLens：大型視覺語言模型中的多模態長期記憶基準測試

摘要

記憶對於大型視覺語言模型（LVLMs）處理長篇多模態互動至關重要，兩類方法方向提供了此能力：長語境LVLMs與記憶增強型代理。然而，現有基準測試尚未針對真正需要多模態證據的問題對兩者進行系統性比較。為填補此空白，我們提出MEMLENS，一個針對多模態多輪對話記憶的綜合性基準，包含789個問題，涵蓋五項記憶能力（資訊提取、多輪推理、時間推理、知識更新與拒絕回答），並在跨模態token計數方案下設置四個標準語境長度（32K-256K tokens）。影像消融實驗證實，解決MEMLENS需要視覺證據：對於80.4%包含影像證據的問題，移除證據影像會使兩個前沿LVLMs的準確率降至2%以下。評估27個LVLMs與7個記憶增強型代理後發現，長語境LVLMs透過直接視覺定位實現高短語境準確率，但隨著對話增長效能下降；記憶代理則具有長度穩定性，但在儲存時壓縮下損失視覺真實性。多輪推理將多數系統限制在30%以下，僅靠單一方法無法解決任務。這些結果激勵了結合長語境注意力與結構化多模態檢索的混合架構。我們的程式碼已公開在 https://github.com/xrenaf/MEMLENS。

English

Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we introduce MEMLENS, a comprehensive benchmark for memory in multimodal multi-session conversations, comprising 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal) at four standard context lengths (32K-256K tokens) under a cross-modal token-counting scheme. An image-ablation study confirms that solving MEMLENS requires visual evidence: removing evidence images drops two frontier LVLMs below 2% accuracy on the 80.4% of questions whose evidence includes images. Evaluating 27 LVLMs and 7 memory-augmented agents, we find that long-context LVLMs achieve high short-context accuracy through direct visual grounding but degrade as conversations grow, whereas memory agents are length-stable but lose visual fidelity under storage-time compression. Multi-session reasoning caps most systems below 30%, and neither approach alone solves the task. These results motivate hybrid architectures that combine long-context attention with structured multimodal retrieval. Our code is available at https://github.com/xrenaf/MEMLENS.