MemLens: 大規模視覚言語モデルにおけるマルチモーダル長期記憶のベンチマーキング

要旨

大規模視覚言語モデル（LVLMs）にとって、長くマルチモーダルな対話を処理するには記憶が不可欠であり、この能力を提供する手法として、長文脈LVLMとメモリ拡張エージェントの二つの方向性がある。しかし、既存のベンチマークでは、真にマルチモーダルな証拠を必要とする質問について、これら二つを系統的に比較したものは存在しない。このギャップを埋めるため、我々はMEMLENSを導入する。これはマルチモーダルなマルチセッション対話における記憶を評価する包括的ベンチマークであり、789の質問から構成され、5つの記憶能力（情報抽出、マルチセッション推論、時間推論、知識更新、回答拒否）を、クロスモーダルなトークン数換算方式のもとで4つの標準的文脈長（32K～256Kトークン）に対して評価する。画像アブレーション研究により、MEMLENSの解決には視覚的証拠が必要であることが確認された。証拠画像を除去すると、証拠に画像を含む質問（全体の80.4%）において、最先端の二つのLVLMの正解率は2%を下回る。27のLVLMと7のメモリ拡張エージェントを評価した結果、長文脈LVLMは直接的な視覚的根拠に基づき短い文脈では高い正解率を示すものの、対話が長くなるにつれて性能が低下する。一方、メモリエージェントは長さに対して安定しているが、保存時の圧縮により視覚的忠実度が低下する。マルチセッション推論ではほとんどのシステムが30%を下回り、どちらか一方のアプローチだけでは課題を解決できない。これらの結果は、長文脈注意機構と構造化マルチモーダル検索を組み合わせたハイブリッドアーキテクチャの必要性を示唆する。コードは https://github.com/xrenaf/MEMLENS で公開している。

English

Memory is essential for large vision-language models (LVLMs) to handle long, multimodal interactions, with two method directions providing this capability: long-context LVLMs and memory-augmented agents. However, no existing benchmark conducts a systematic comparison of the two on questions that genuinely require multimodal evidence. To close this gap, we introduce MEMLENS, a comprehensive benchmark for memory in multimodal multi-session conversations, comprising 789 questions across five memory abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge update, and answer refusal) at four standard context lengths (32K-256K tokens) under a cross-modal token-counting scheme. An image-ablation study confirms that solving MEMLENS requires visual evidence: removing evidence images drops two frontier LVLMs below 2% accuracy on the 80.4% of questions whose evidence includes images. Evaluating 27 LVLMs and 7 memory-augmented agents, we find that long-context LVLMs achieve high short-context accuracy through direct visual grounding but degrade as conversations grow, whereas memory agents are length-stable but lose visual fidelity under storage-time compression. Multi-session reasoning caps most systems below 30%, and neither approach alone solves the task. These results motivate hybrid architectures that combine long-context attention with structured multimodal retrieval. Our code is available at https://github.com/xrenaf/MEMLENS.