TriForce：使用分层推测解码实现长序列生成的无损加速

摘要

随着大型语言模型（LLMs）在最近广泛应用于长内容生成中，对高效长序列推理支持的需求不断增加。然而，为了避免重新计算而存储的键-值（KV）缓存，由于随着序列长度的增加呈线性增长，已成为一个关键瓶颈。由于LLMs的自回归特性，每生成一个标记都需要加载整个KV缓存，导致计算核心利用率低，延迟高。虽然已经提出了各种KV缓存压缩方法来缓解这一问题，但它们在生成质量上存在下降的问题。我们引入了TriForce，这是一个可扩展到长序列生成的分层推测解码系统。该方法利用原始模型权重和通过检索作为草稿模型的动态稀疏KV缓存，该模型作为层次结构中的中间层，并通过较小的模型进一步推测以减少其起草延迟。TriForce不仅为Llama2-7B-128K实现了令人印象深刻的加速，最高可在A100 GPU上实现2.31倍，而且展示了处理更长上下文的可扩展性。在两个RTX 4090 GPU的卸载设置中，TriForce实现了0.108秒/标记，仅为A100上自回归基线的一半，后者在我们优化的卸载系统上达到了7.78倍。此外，TriForce在单个RTX 4090 GPU上的性能比DeepSpeed-Zero-Inference高出4.86倍。TriForce的稳健性体现在其在各种温度下始终出色的性能。代码可在https://github.com/Infini-AI-Lab/TriForce找到。

English

With large language models (LLMs) widely deployed in long content generation recently, there has emerged an increasing demand for efficient long-sequence inference support. However, key-value (KV) cache, which is stored to avoid re-computation, has emerged as a critical bottleneck by growing linearly in size with the sequence length. Due to the auto-regressive nature of LLMs, the entire KV cache will be loaded for every generated token, resulting in low utilization of computational cores and high latency. While various compression methods for KV cache have been proposed to alleviate this issue, they suffer from degradation in generation quality. We introduce TriForce, a hierarchical speculative decoding system that is scalable to long sequence generation. This approach leverages the original model weights and dynamic sparse KV cache via retrieval as a draft model, which serves as an intermediate layer in the hierarchy and is further speculated by a smaller model to reduce its drafting latency. TriForce not only facilitates impressive speedups for Llama2-7B-128K, achieving up to 2.31times on an A100 GPU but also showcases scalability in handling even longer contexts. For the offloading setting on two RTX 4090 GPUs, TriForce achieves 0.108s/tokenx2014only half as slow as the auto-regressive baseline on an A100, which attains 7.78times on our optimized offloading system. Additionally, TriForce performs 4.86times than DeepSpeed-Zero-Inference on a single RTX 4090 GPU. TriForce's robustness is highlighted by its consistently outstanding performance across various temperatures. The code is available at https://github.com/Infini-AI-Lab/TriForce.

TriForce：使用分层推测解码实现长序列生成的无损加速

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

摘要

Summary

Support

Support