RAG를 넘어서: 포괄적 지식 추론을 위한 작업 인지형 KV 캐시 압축

초록

대규모 언어 모델(LLM)에 외부 지식을 통합하면 다양한 애플리케이션에서의 유용성이 향상되지만, 기존 방법들은 각각의 장단점이 존재합니다. 검색 증강 생성(Retrieval-Augmented Generation, RAG)은 유사성 검색을 통해 증거를 가져오지만, 중요한 정보가 상위 순위 결과에 포함되지 않을 수 있습니다. 긴 문맥 모델은 여러 문서를 처리할 수 있지만, 계산 비용이 많이 들고 문맥 창 크기에 제한이 있습니다. 학생들이 오픈북 시험을 위해 학습 자료를 요약하는 방식에서 영감을 받아, 우리는 작업 인식 키-값(Key-Value, KV) 캐시 압축을 제안합니다. 이 방법은 제로샷 또는 퓨샷 설정에서 외부 지식을 압축하여, LLM이 모든 관련 정보의 간결한 표현을 효율적으로 추론할 수 있게 합니다. 실험 결과, 우리의 접근 방식은 RAG와 작업에 무관한 압축 방법 모두를 능가하는 것으로 나타났습니다. LongBench v2에서, 30배의 압축률로 RAG 대비 최대 7%의 정확도 향상을 보였으며, 추론 지연 시간도 0.43초에서 0.16초로 줄였습니다. 합성 데이터셋을 통해, RAG는 희소한 증거만으로 충분한 경우에 잘 작동하는 반면, 작업 인식 압축은 광범위한 지식 작업에서 더 우수함을 확인했습니다.

English

Incorporating external knowledge in large language models (LLMs) enhances their utility across diverse applications, but existing methods have trade-offs. Retrieval-Augmented Generation (RAG) fetches evidence via similarity search, but key information may fall outside top ranked results. Long-context models can process multiple documents but are computationally expensive and limited by context window size. Inspired by students condensing study material for open-book exams, we propose task-aware key-value (KV) cache compression, which compresses external knowledge in a zero- or few-shot setup. This enables LLMs to reason efficiently over a compacted representation of all relevant information. Experiments show our approach outperforms both RAG and task-agnostic compression methods. On LongBench v2, it improves accuracy by up to 7 absolute points over RAG with a 30x compression rate, while reducing inference latency from 0.43s to 0.16s. A synthetic dataset highlights that RAG performs well when sparse evidence suffices, whereas task-aware compression is superior for broad knowledge tasks.

RAG를 넘어서: 포괄적 지식 추론을 위한 작업 인지형 KV 캐시 압축

Beyond RAG: Task-Aware KV Cache Compression for Comprehensive Knowledge Reasoning

초록

Support