TeleRAG: 룩어헤드 검색을 통한 효율적인 검색 증강 생성 추론

초록

검색 강화 생성(Retrieval-Augmented Generation, RAG)은 대규모 언어 모델(Large Language Models, LLMs)을 외부 데이터 소스와 결합하여 사실 정확성과 도메인 커버리지를 향상시킵니다. 현대의 RAG 파이프라인은 대규모 데이터 저장소에 의존하므로, 특히 GPU 메모리가 제한된 환경에서 지연 시간에 민감한 배포 시 시스템적 어려움이 발생합니다. 이러한 문제를 해결하기 위해, 우리는 GPU 메모리 요구량을 최소화하면서 RAG 지연 시간을 줄이는 효율적인 추론 시스템인 TeleRAG를 제안합니다. TeleRAG의 핵심 혁신은 필요한 데이터를 미리 예측하여 CPU에서 GPU로 병렬로 전송하는 선행 검색(lookahead retrieval) 메커니즘입니다. RAG 파이프라인의 모듈성, 역파일 인덱스(Inverted File Index, IVF) 검색 알고리즘, 그리고 쿼리 간 유사성을 활용함으로써, TeleRAG는 데이터 이동과 계산을 최적으로 중첩시킵니다. 실험 결과에 따르면, TeleRAG는 최신 시스템 대비 평균 최대 1.72배의 종단 간 RAG 추론 지연 시간을 단축하여, 고급 RAG 애플리케이션의 더 빠르고 메모리 효율적인 배포를 가능하게 합니다.

English

Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage. Modern RAG pipelines rely on large datastores, leading to system challenges in latency-sensitive deployments, especially when limited GPU memory is available. To address these challenges, we propose TeleRAG, an efficient inference system that reduces RAG latency with minimal GPU memory requirements. The core innovation of TeleRAG is lookahead retrieval, a prefetching mechanism that anticipates required data and transfers it from CPU to GPU in parallel with LLM generation. By leveraging the modularity of RAG pipelines, the inverted file index (IVF) search algorithm and similarities between queries, TeleRAG optimally overlaps data movement and computation. Experimental results show that TeleRAG reduces end-to-end RAG inference latency by up to 1.72x on average compared to state-of-the-art systems, enabling faster, more memory-efficient deployments of advanced RAG applications.

TeleRAG: 룩어헤드 검색을 통한 효율적인 검색 증강 생성 추론

TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval

초록

Support