ChatPaper.aiChatPaper

每项多模态证据一个标记:面向资源受限问答的隐式记忆

One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

June 9, 2026
作者: Zhi Zheng, Ziqiao Meng, Hao Luan, Wei Liu, Wee Sun Lee
cs.AI

摘要

外部存储器有效地将基于大语言模型(LLM)和视觉语言模型(VLM)的问答(QA)锚定在相关的多模态证据上。然而,现有的记忆范式以原始文本和图像形式表示每条记忆项,因此基于检索的系统必须将检索到的文本或图像传递给生成式LLM/VLM,这导致高令牌消耗和存储压力,使得资源受限的应用难以承受。我们提出潜在记忆(Latent Memory),一种潜在空间记忆范式,它将每条原始文本或图像证据项替换为由小型压缩器LLM/VLM生成的单个高维潜在标记。潜在记忆并非检索原始证据用于生成,而是在统一的潜在表示空间中运行:将查询嵌入该空间以检索相关潜在标记,并将检索到的潜在标记直接提示给预训练的LLM或VLM以生成答案。为使每个潜在标记同时具备重构、检索和生成所需信息,我们使用重构、对比和蒸馏目标以统一的端到端方式训练压缩器。潜在记忆在七个纯文本问答基准(如HotpotQA)和多模态问答基准上进行了评估,与先进的RAG基线相比,它在取得有竞争力的问答性能的同时,消耗的生成器令牌减少了3到10倍。它还能在WebQA上实现最强的图像支撑问答性能。代码见 https://github.com/zz1358m/Latent-Memory-Master。
English
External memory effectively grounds large language models (LLMs) and vision-language models (VLMs)-based question answering (QA) in relevant multimodal evidence. However, existing memory paradigms represent each memory item in raw text and image forms, so retrieval-based systems must pass the retrieved text or images to the generation LLMs/VLMs, resulting in high token consumption and storage pressure, making it unaffordable for resource-constrained applications. We propose Latent Memory, a latent-space memory paradigm that replaces each raw text or image evidence item with a single high-dimensional latent token produced by a small compressor LLM/VLM. Rather than retrieving raw evidence for generation, Latent Memory operates in a unified latent representation space: the query is embedded into this space to retrieve relevant latent tokens, and the retrieved latent tokens are directly prompted to a pretrained LLM or VLM for answer generation. To make each latent token simultaneously informative for reconstruction, retrieval, and generation, we train the compressor with reconstruction, contrastive, and distillation objectives in a unified end-to-end manner. Latent Memory is evaluated on seven text-only QA benchmarks (e.g., HotpotQA) and multimodal QA benchmarks, where it achieves competitive QA performance compared to advanced RAG baselines while consuming 3x to 10x fewer generator tokens. It can also deliver the strongest image-grounded QA performance on WebQA. Code is available at https://github.com/zz1358m/Latent-Memory-Master.