RAGを超えて：包括的知識推論のためのタスク認識型KVキャッシュ圧縮

要旨

大規模言語モデル（LLM）に外部知識を組み込むことで、多様なアプリケーションにおける有用性が向上しますが、既存の手法にはトレードオフが存在します。Retrieval-Augmented Generation（RAG）は類似性検索を通じて証拠を取得しますが、重要な情報が上位ランクの結果に含まれない場合があります。長文脈モデルは複数のドキュメントを処理できますが、計算コストが高く、コンテキストウィンドウのサイズに制約があります。学生が参考書を持ち込める試験のために学習資料を要約する方法に着想を得て、我々はタスクを意識したキー・バリュー（KV）キャッシュ圧縮を提案します。これはゼロショットまたは少数ショットの設定で外部知識を圧縮し、LLMが関連するすべての情報をコンパクトに表現した上で効率的に推論することを可能にします。実験結果は、我々のアプローチがRAGとタスク非依存の圧縮手法の両方を上回ることを示しています。LongBench v2では、30倍の圧縮率でRAGに対して最大7ポイントの精度向上を達成し、推論遅延を0.43秒から0.16秒に削減しました。合成データセットを用いた分析では、RAGは証拠が疎な場合に有効であるのに対し、広範な知識を必要とするタスクではタスクを意識した圧縮が優れていることが明らかになりました。

English

Incorporating external knowledge in large language models (LLMs) enhances their utility across diverse applications, but existing methods have trade-offs. Retrieval-Augmented Generation (RAG) fetches evidence via similarity search, but key information may fall outside top ranked results. Long-context models can process multiple documents but are computationally expensive and limited by context window size. Inspired by students condensing study material for open-book exams, we propose task-aware key-value (KV) cache compression, which compresses external knowledge in a zero- or few-shot setup. This enables LLMs to reason efficiently over a compacted representation of all relevant information. Experiments show our approach outperforms both RAG and task-agnostic compression methods. On LongBench v2, it improves accuracy by up to 7 absolute points over RAG with a 30x compression rate, while reducing inference latency from 0.43s to 0.16s. A synthetic dataset highlights that RAG performs well when sparse evidence suffices, whereas task-aware compression is superior for broad knowledge tasks.

RAGを超えて：包括的知識推論のためのタスク認識型KVキャッシュ圧縮

Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning

要旨

Support