每項多模態證據一個標記:資源受限問答系統中的潛在記憶
One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA
June 9, 2026
作者: Zhi Zheng, Ziqiao Meng, Hao Luan, Wei Liu, Wee Sun Lee
cs.AI
摘要
外部記憶有效地將基於大語言模型(LLMs)和視覺語言模型(VLMs)的問答(QA)與相關的多模態證據結合起來。然而,現有的記憶範式以原始文本和圖像形式表示每個記憶項目,因此基於檢索的系統必須將檢索到的文本或圖像傳遞給生成式LLMs/VLMs,導致標記消耗高、儲存壓力大,對於資源受限的應用來說難以負擔。我們提出潛在記憶(Latent Memory),一種潛在空間中的記憶範式,將每個原始文本或圖像證據項目替換為由小型壓縮器LLM/VLM生成的單個高維潛在標記。潛在記憶並非檢索原始證據用於生成,而是在統一的潛在表徵空間中運作:查詢被嵌入該空間以檢索相關的潛在標記,然後將檢索到的潛在標記直接提示給預訓練的LLM或VLM進行答案生成。為了使每個潛在標記同時對重建、檢索和生成具有信息量,我們使用重建、對比和蒸餾目標以統一的端到端方式訓練壓縮器。潛在記憶在七個純文本問答基準(例如HotpotQA)和多模態問答基準上進行了評估,與先進的RAG基線相比,它在取得競爭性問答性能的同時,減少了3到10倍的生成器標記消耗。在WebQA上,它還能實現最強的圖像支撐問答性能。代碼可在https://github.com/zz1358m/Latent-Memory-Master 獲取。
English
External memory effectively grounds large language models (LLMs) and vision-language models (VLMs)-based question answering (QA) in relevant multimodal evidence. However, existing memory paradigms represent each memory item in raw text and image forms, so retrieval-based systems must pass the retrieved text or images to the generation LLMs/VLMs, resulting in high token consumption and storage pressure, making it unaffordable for resource-constrained applications. We propose Latent Memory, a latent-space memory paradigm that replaces each raw text or image evidence item with a single high-dimensional latent token produced by a small compressor LLM/VLM. Rather than retrieving raw evidence for generation, Latent Memory operates in a unified latent representation space: the query is embedded into this space to retrieve relevant latent tokens, and the retrieved latent tokens are directly prompted to a pretrained LLM or VLM for answer generation. To make each latent token simultaneously informative for reconstruction, retrieval, and generation, we train the compressor with reconstruction, contrastive, and distillation objectives in a unified end-to-end manner. Latent Memory is evaluated on seven text-only QA benchmarks (e.g., HotpotQA) and multimodal QA benchmarks, where it achieves competitive QA performance compared to advanced RAG baselines while consuming 3x to 10x fewer generator tokens. It can also deliver the strongest image-grounded QA performance on WebQA. Code is available at https://github.com/zz1358m/Latent-Memory-Master.