ChatPaper.aiChatPaper

TeleRAG:具前瞻性檢索的高效檢索增強生成推理

TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval

February 28, 2025
作者: Chien-Yu Lin, Keisuke Kamahori, Yiyu Liu, Xiaoxiang Shi, Madhav Kashyap, Yile Gu, Rulin Shao, Zihao Ye, Kan Zhu, Stephanie Wang, Arvind Krishnamurthy, Rohan Kadekodi, Luis Ceze, Baris Kasikci
cs.AI

摘要

檢索增強生成(RAG)透過整合外部資料來源來擴展大型語言模型(LLM),以提升事實準確性與領域覆蓋率。現代RAG管線依賴於大型資料庫,這在對延遲敏感的部署中帶來了系統挑戰,尤其是在GPU記憶體有限的情況下。為應對這些挑戰,我們提出了TeleRAG,這是一個高效的推理系統,能在最小化GPU記憶體需求的同時降低RAG延遲。TeleRAG的核心創新在於前瞻檢索,這是一種預取機制,能夠預測所需資料並在LLM生成的同時將其從CPU傳輸至GPU。透過利用RAG管線的模組化特性、倒排檔案索引(IVF)搜尋演算法以及查詢之間的相似性,TeleRAG最佳化地重疊了資料移動與計算。實驗結果顯示,與最先進的系統相比,TeleRAG平均可將端到端RAG推理延遲降低最多1.72倍,從而實現更快速、更節省記憶體的先進RAG應用部署。
English
Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage. Modern RAG pipelines rely on large datastores, leading to system challenges in latency-sensitive deployments, especially when limited GPU memory is available. To address these challenges, we propose TeleRAG, an efficient inference system that reduces RAG latency with minimal GPU memory requirements. The core innovation of TeleRAG is lookahead retrieval, a prefetching mechanism that anticipates required data and transfers it from CPU to GPU in parallel with LLM generation. By leveraging the modularity of RAG pipelines, the inverted file index (IVF) search algorithm and similarities between queries, TeleRAG optimally overlaps data movement and computation. Experimental results show that TeleRAG reduces end-to-end RAG inference latency by up to 1.72x on average compared to state-of-the-art systems, enabling faster, more memory-efficient deployments of advanced RAG applications.

Summary

AI-Generated Summary

PDF112March 3, 2025