超越RAG:面向全面知识推理的任务感知KV缓存压缩
Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning
March 6, 2025
作者: Giulio Corallo, Orion Weller, Fabio Petroni, Paolo Papotti
cs.AI
摘要
將外部知識融入大型語言模型(LLMs)能顯著提升其在多樣應用中的效用,但現有方法均存在權衡。檢索增強生成(RAG)通過相似性搜索獲取證據,但關鍵信息可能不在排名靠前的結果中。長上下文模型雖能處理多份文檔,卻計算成本高昂且受上下文窗口大小限制。受學生為開卷考試精簡學習資料的啟發,我們提出了任務感知的鍵值(KV)緩存壓縮技術,該技術在零樣本或少樣本設置下壓縮外部知識,使LLMs能夠高效地對所有相關信息的精簡表示進行推理。實驗表明,我們的方法在準確性上優於RAG及任務無關的壓縮方法。在LongBench v2上,以30倍壓縮率,其準確率較RAG提升最多達7個絕對百分點,同時將推理延遲從0.43秒降至0.16秒。一項合成數據集分析揭示,當稀疏證據足夠時,RAG表現良好,而對於廣泛知識任務,任務感知壓縮則更為優越。
English
Incorporating external knowledge in large language models (LLMs) enhances
their utility across diverse applications, but existing methods have
trade-offs. Retrieval-Augmented Generation (RAG) fetches evidence via
similarity search, but key information may fall outside top ranked results.
Long-context models can process multiple documents but are computationally
expensive and limited by context window size. Inspired by students condensing
study material for open-book exams, we propose task-aware key-value (KV) cache
compression, which compresses external knowledge in a zero- or few-shot setup.
This enables LLMs to reason efficiently over a compacted representation of all
relevant information. Experiments show our approach outperforms both RAG and
task-agnostic compression methods. On LongBench v2, it improves accuracy by up
to 7 absolute points over RAG with a 30x compression rate, while reducing
inference latency from 0.43s to 0.16s. A synthetic dataset highlights that RAG
performs well when sparse evidence suffices, whereas task-aware compression is
superior for broad knowledge tasks.Summary
AI-Generated Summary