RetroInfer：一种面向可扩展长上下文LLM推理的向量存储方法

摘要

随着大语言模型（LLMs）上下文长度的不断增长，高效推理面临重大挑战，主要受限于GPU内存和带宽。我们提出RetroInfer，一个创新系统，它将键值（KV）缓存重新构想为向量存储系统，利用注意力稀疏性加速长上下文LLM推理。其核心是wave索引，一种注意力感知向量索引，通过三方注意力近似、精度有界的注意力估计及分段聚类等技术，实现关键令牌的高效精准检索。与之相辅相成的是wave缓冲区，它协调KV缓存布局，并在GPU与CPU间重叠计算与数据传输，以维持高吞吐量。不同于以往基于稀疏性的方法在令牌选择与硬件协调上的困境，RetroInfer在不牺牲模型精度的前提下，提供了稳健的性能。在长上下文基准测试中，RetroInfer在GPU内存限制内相比全注意力机制实现了最高4.5倍的加速，当KV缓存扩展至CPU内存时，相较于稀疏注意力基线更达到了10.5倍的提升，同时保持了全注意力级别的准确性。

English

The growing context lengths of large language models (LLMs) pose significant challenges for efficient inference, primarily due to GPU memory and bandwidth constraints. We present RetroInfer, a novel system that reconceptualizes the key-value (KV) cache as a vector storage system which exploits the inherent attention sparsity to accelerate long-context LLM inference. At its core is the wave index, an Attention-aWare VEctor index that enables efficient and accurate retrieval of critical tokens through techniques such as tripartite attention approximation, accuracy-bounded attention estimation, and segmented clustering. Complementing this is the wave buffer, which coordinates KV cache placement and overlaps computation and data transfer across GPU and CPU to sustain high throughput. Unlike prior sparsity-based methods that struggle with token selection and hardware coordination, RetroInfer delivers robust performance without compromising model accuracy. Experiments on long-context benchmarks show up to 4.5X speedup over full attention within GPU memory limits and up to 10.5X over sparse attention baselines when KV cache is extended to CPU memory, all while preserving full-attention-level accuracy.

RetroInfer：一种面向可扩展长上下文LLM推理的向量存储方法

RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference

摘要

Support