ChatPaper.aiChatPaper

TriForce:使用階層式推理解碼實現長序列生成的無損加速。

TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding

April 18, 2024
作者: Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen
cs.AI

摘要

隨著大型語言模型(LLMs)近來在生成長內容方面被廣泛應用,對於高效的長序列推理支持需求不斷增加。然而,為了避免重新計算而存儲的關鍵-值(KV)快取,由於隨著序列長度的增長呈線性增長,已成為一個關鍵瓶頸。由於LLMs的自回歸特性,每生成一個標記都需要加載整個KV快取,導致計算核心的利用率低且延遲高。雖然已提出各種用於緩解此問題的KV快取壓縮方法,但這些方法在生成質量上存在下降的問題。我們介紹了TriForce,這是一個可擴展到生成長序列的分層推理解碼系統。該方法利用原始模型權重和通過檢索作為草稿模型的動態稀疏KV快取,該草稿模型作為層次結構中的中間層,並進一步由較小的模型進行推測,以減少其草擬延遲。TriForce不僅為Llama2-7B-128K實現了令人印象深刻的加速,最高可達A100 GPU的2.31倍,而且展示了處理更長上下文的可擴展性。對於在兩個RTX 4090 GPU上的卸載設置,TriForce實現了0.108秒/標記,僅為A100上自回歸基線的一半,後者在我們優化的卸載系統上達到7.78倍。此外,TriForce在單個RTX 4090 GPU上的表現比DeepSpeed-Zero-Inference快4.86倍。TriForce的穩健性突顯在其在各種溫度下始終出色的性能。代碼可在https://github.com/Infini-AI-Lab/TriForce找到。
English
With large language models (LLMs) widely deployed in long content generation recently, there has emerged an increasing demand for efficient long-sequence inference support. However, key-value (KV) cache, which is stored to avoid re-computation, has emerged as a critical bottleneck by growing linearly in size with the sequence length. Due to the auto-regressive nature of LLMs, the entire KV cache will be loaded for every generated token, resulting in low utilization of computational cores and high latency. While various compression methods for KV cache have been proposed to alleviate this issue, they suffer from degradation in generation quality. We introduce TriForce, a hierarchical speculative decoding system that is scalable to long sequence generation. This approach leverages the original model weights and dynamic sparse KV cache via retrieval as a draft model, which serves as an intermediate layer in the hierarchy and is further speculated by a smaller model to reduce its drafting latency. TriForce not only facilitates impressive speedups for Llama2-7B-128K, achieving up to 2.31times on an A100 GPU but also showcases scalability in handling even longer contexts. For the offloading setting on two RTX 4090 GPUs, TriForce achieves 0.108s/tokenx2014only half as slow as the auto-regressive baseline on an A100, which attains 7.78times on our optimized offloading system. Additionally, TriForce performs 4.86times than DeepSpeed-Zero-Inference on a single RTX 4090 GPU. TriForce's robustness is highlighted by its consistently outstanding performance across various temperatures. The code is available at https://github.com/Infini-AI-Lab/TriForce.

Summary

AI-Generated Summary

PDF171December 15, 2024