Deja Vu：上下文稀疏性用於推論時的高效LLM

摘要

擁有數千億參數的大型語言模型(LLMs)引發了一波新的令人興奮的人工智能應用。然而，在推理時它們需要大量的計算資源。稀疏性是一種自然的方法來降低這種成本，但現有的方法要麼需要昂貴的重新訓練，要麼必須放棄LLM的上下文學習能力，或者在現代硬件上並不能實現牆時鐘時間加速。我們假設上下文稀疏性，即對於給定輸入而言，能夠產生與密集模型近似相同輸出的小型、輸入相關的注意力頭和MLP參數集，可以解決這些問題。我們展示了上下文稀疏性的存在，它可以被準確預測，我們可以利用它來加速LLM的推理過程，而不會影響LLM的質量或上下文學習能力。基於這些見解，我們提出了DejaVu，一個使用低成本算法來根據每一層的輸入動態預測上下文稀疏性的系統，以及一個異步且硬件感知的實現，可以加速LLM的推理。我們驗證DejaVu相對於最先進的FasterTransformer可以將OPT-175B的推理延遲減少超過2倍，相對於廣泛使用的Hugging Face實現可以減少超過6倍，而不會影響模型質量。代碼可在https://github.com/FMInference/DejaVu找到。

English

Large language models (LLMs) with hundreds of billions of parameters have sparked a new wave of exciting AI applications. However, they are computationally expensive at inference time. Sparsity is a natural approach to reduce this cost, but existing methods either require costly retraining, have to forgo LLM's in-context learning ability, or do not yield wall-clock time speedup on modern hardware. We hypothesize that contextual sparsity, which are small, input-dependent sets of attention heads and MLP parameters that yield approximately the same output as the dense model for a given input, can address these issues. We show that contextual sparsity exists, that it can be accurately predicted, and that we can exploit it to speed up LLM inference in wall-clock time without compromising LLM's quality or in-context learning ability. Based on these insights, we propose DejaVu, a system that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware-aware implementation that speeds up LLM inference. We validate that DejaVu can reduce the inference latency of OPT-175B by over 2X compared to the state-of-the-art FasterTransformer, and over 6X compared to the widely used Hugging Face implementation, without compromising model quality. The code is available at https://github.com/FMInference/DejaVu.

Deja Vu：上下文稀疏性用於推論時的高效LLM

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time

摘要

Support