멀티모달 증거당 하나의 토큰: 자원 제약적 QA를 위한 잠재 메모리

초록

외부 메모리는 대규모 언어 모델(LLM)과 비전-언어 모델(VLM) 기반 질의응답(QA)을 관련 다중 모드 증거에 효과적으로 기반하게 한다. 그러나 기존 메모리 패러다임은 각 메모리 항목을 원시 텍스트와 이미지 형태로 표현하므로, 검색 기반 시스템은 검색된 텍스트나 이미지를 생성 LLM/VLM에 전달해야 하여 높은 토큰 소비와 저장 압력을 초래하며, 자원이 제한된 애플리케이션에서는 감당하기 어렵다. 본 연구에서는 잠재 메모리(Latent Memory)라는 잠재 공간 메모리 패러다임을 제안한다. 이는 각 원시 텍스트 또는 이미지 증거 항목을 작은 압축기 LLM/VLM이 생성한 단일 고차원 잠재 토큰으로 대체한다. 잠재 메모리는 생성에 원시 증거를 검색하는 대신, 통합된 잠재 표현 공간에서 작동한다. 질의를 이 공간에 임베딩하여 관련 잠재 토큰을 검색하고, 검색된 잠재 토큰을 사전 학습된 LLM 또는 VLM에 직접 프롬프트하여 답변을 생성한다. 각 잠재 토큰이 재구성, 검색, 생성에 동시에 유용한 정보를 제공하도록, 재구성, 대조, 증류 목적 함수를 사용하여 압축기를 통합된 종단 간 방식으로 학습한다. 잠재 메모리는 일곱 개의 텍스트 전용 QA 벤치마크(예: HotpotQA)와 다중 모드 QA 벤치마크에서 평가되었으며, 고급 RAG 기준선과 비교하여 경쟁력 있는 QA 성능을 달성하면서도 생성기 토큰을 3배에서 10배까지 덜 소비한다. 또한 WebQA에서 가장 강력한 이미지 기반 QA 성능을 제공한다. 코드는 https://github.com/zz1358m/Latent-Memory-Master에서 확인할 수 있다.

English

External memory effectively grounds large language models (LLMs) and vision-language models (VLMs)-based question answering (QA) in relevant multimodal evidence. However, existing memory paradigms represent each memory item in raw text and image forms, so retrieval-based systems must pass the retrieved text or images to the generation LLMs/VLMs, resulting in high token consumption and storage pressure, making it unaffordable for resource-constrained applications. We propose Latent Memory, a latent-space memory paradigm that replaces each raw text or image evidence item with a single high-dimensional latent token produced by a small compressor LLM/VLM. Rather than retrieving raw evidence for generation, Latent Memory operates in a unified latent representation space: the query is embedded into this space to retrieve relevant latent tokens, and the retrieved latent tokens are directly prompted to a pretrained LLM or VLM for answer generation. To make each latent token simultaneously informative for reconstruction, retrieval, and generation, we train the compressor with reconstruction, contrastive, and distillation objectives in a unified end-to-end manner. Latent Memory is evaluated on seven text-only QA benchmarks (e.g., HotpotQA) and multimodal QA benchmarks, where it achieves competitive QA performance compared to advanced RAG baselines while consuming 3x to 10x fewer generator tokens. It can also deliver the strongest image-grounded QA performance on WebQA. Code is available at https://github.com/zz1358m/Latent-Memory-Master.