マルチモーダル証拠あたり1トークン：リソース制約下のQAのための潜在メモリ

要旨

外部記憶は、大規模言語モデル（LLM）や視覚言語モデル（VLM）に基づく質問応答（QA）を、関連するマルチモーダルな証拠で効果的に基礎づける。しかし、既存の記憶パラダイムでは各記憶項目を生のテキストや画像形式で表現するため、検索ベースのシステムは取得したテキストや画像を生成用のLLM/VLMに渡さなければならず、その結果、高いトークン消費とストレージ負荷が生じ、リソース制約のあるアプリケーションには不向きである。我々は、潜在記憶（Latent Memory）を提案する。これは、各生のテキストや画像の証拠項目を、小型の圧縮器LLM/VLMが生成する単一の高次元潜在トークンに置き換える潜在空間記憶パラダイムである。生の証拠を検索して生成に用いる代わりに、潜在記憶は統合された潜在表現空間で動作する。すなわち、クエリをこの空間に埋め込んで関連する潜在トークンを検索し、検索された潜在トークンを直接、事前学習済みのLLMやVLMにプロンプトとして与え、回答を生成する。各潜在トークンが再構成、検索、生成のすべてに対して情報豊かであるようにするため、圧縮器を再構成、対照学習、蒸留の目的関数を用いて統一されたエンドツーエンド方式で訓練する。潜在記憶は、7つのテキストのみのQAベンチマーク（例：HotpotQA）およびマルチモーダルQAベンチマークで評価され、高度なRAGベースラインと同等のQA性能を達成しつつ、生成トークンを3倍から10倍削減する。また、WebQAにおいては最も強力な画像に基づくQA性能を発揮する。コードはhttps://github.com/zz1358m/Latent-Memory-Masterで入手可能である。

English

External memory effectively grounds large language models (LLMs) and vision-language models (VLMs)-based question answering (QA) in relevant multimodal evidence. However, existing memory paradigms represent each memory item in raw text and image forms, so retrieval-based systems must pass the retrieved text or images to the generation LLMs/VLMs, resulting in high token consumption and storage pressure, making it unaffordable for resource-constrained applications. We propose Latent Memory, a latent-space memory paradigm that replaces each raw text or image evidence item with a single high-dimensional latent token produced by a small compressor LLM/VLM. Rather than retrieving raw evidence for generation, Latent Memory operates in a unified latent representation space: the query is embedded into this space to retrieve relevant latent tokens, and the retrieved latent tokens are directly prompted to a pretrained LLM or VLM for answer generation. To make each latent token simultaneously informative for reconstruction, retrieval, and generation, we train the compressor with reconstruction, contrastive, and distillation objectives in a unified end-to-end manner. Latent Memory is evaluated on seven text-only QA benchmarks (e.g., HotpotQA) and multimodal QA benchmarks, where it achieves competitive QA performance compared to advanced RAG baselines while consuming 3x to 10x fewer generator tokens. It can also deliver the strongest image-grounded QA performance on WebQA. Code is available at https://github.com/zz1358m/Latent-Memory-Master.