RetroInfer：一種面向可擴展長上下文LLM推理的向量存儲方法

摘要

大型語言模型（LLMs）日益增長的上下文長度對高效推理提出了重大挑戰，這主要歸因於GPU記憶體和頻寬的限制。我們提出了RetroInfer，這是一個新穎的系統，它將鍵值（KV）快取重新構想為一種向量儲存系統，利用固有的注意力稀疏性來加速長上下文LLM推理。其核心是wave index，一種Attention-aWare VEctor索引，通過三部分注意力近似、精度有界的注意力估計和分段聚類等技術，實現了關鍵詞元的有效且準確檢索。與之相輔相成的是wave buffer，它協調KV快取的放置，並在GPU和CPU之間重疊計算和數據傳輸，以維持高吞吐量。與之前基於稀疏性的方法在詞元選擇和硬體協調上掙扎不同，RetroInfer在不影響模型精度的情況下提供了穩健的性能。在長上下文基準測試中的實驗表明，在GPU記憶體限制內，相比全注意力機制實現了最高4.5倍的加速，當KV快取擴展到CPU記憶體時，相比稀疏注意力基線實現了最高10.5倍的加速，同時保持了全注意力級別的精度。

English

The growing context lengths of large language models (LLMs) pose significant challenges for efficient inference, primarily due to GPU memory and bandwidth constraints. We present RetroInfer, a novel system that reconceptualizes the key-value (KV) cache as a vector storage system which exploits the inherent attention sparsity to accelerate long-context LLM inference. At its core is the wave index, an Attention-aWare VEctor index that enables efficient and accurate retrieval of critical tokens through techniques such as tripartite attention approximation, accuracy-bounded attention estimation, and segmented clustering. Complementing this is the wave buffer, which coordinates KV cache placement and overlaps computation and data transfer across GPU and CPU to sustain high throughput. Unlike prior sparsity-based methods that struggle with token selection and hardware coordination, RetroInfer delivers robust performance without compromising model accuracy. Experiments on long-context benchmarks show up to 4.5X speedup over full attention within GPU memory limits and up to 10.5X over sparse attention baselines when KV cache is extended to CPU memory, all while preserving full-attention-level accuracy.

RetroInfer：一種面向可擴展長上下文LLM推理的向量存儲方法

RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference

摘要

Support