Deja Vu:上下文稀疏性用於推論時的高效LLM
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
October 26, 2023
作者: Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen
cs.AI
摘要
擁有數千億參數的大型語言模型(LLMs)引發了一波新的令人興奮的人工智能應用。然而,在推理時它們需要大量的計算資源。稀疏性是一種自然的方法來降低這種成本,但現有的方法要麼需要昂貴的重新訓練,要麼必須放棄LLM的上下文學習能力,或者在現代硬件上並不能實現牆時鐘時間加速。我們假設上下文稀疏性,即對於給定輸入而言,能夠產生與密集模型近似相同輸出的小型、輸入相關的注意力頭和MLP參數集,可以解決這些問題。我們展示了上下文稀疏性的存在,它可以被準確預測,我們可以利用它來加速LLM的推理過程,而不會影響LLM的質量或上下文學習能力。基於這些見解,我們提出了DejaVu,一個使用低成本算法來根據每一層的輸入動態預測上下文稀疏性的系統,以及一個異步且硬件感知的實現,可以加速LLM的推理。我們驗證DejaVu相對於最先進的FasterTransformer可以將OPT-175B的推理延遲減少超過2倍,相對於廣泛使用的Hugging Face實現可以減少超過6倍,而不會影響模型質量。代碼可在https://github.com/FMInference/DejaVu找到。
English
Large language models (LLMs) with hundreds of billions of parameters have
sparked a new wave of exciting AI applications. However, they are
computationally expensive at inference time. Sparsity is a natural approach to
reduce this cost, but existing methods either require costly retraining, have
to forgo LLM's in-context learning ability, or do not yield wall-clock time
speedup on modern hardware. We hypothesize that contextual sparsity, which are
small, input-dependent sets of attention heads and MLP parameters that yield
approximately the same output as the dense model for a given input, can address
these issues. We show that contextual sparsity exists, that it can be
accurately predicted, and that we can exploit it to speed up LLM inference in
wall-clock time without compromising LLM's quality or in-context learning
ability. Based on these insights, we propose DejaVu, a system that uses a
low-cost algorithm to predict contextual sparsity on the fly given inputs to
each layer, along with an asynchronous and hardware-aware implementation that
speeds up LLM inference. We validate that DejaVu can reduce the inference
latency of OPT-175B by over 2X compared to the state-of-the-art
FasterTransformer, and over 6X compared to the widely used Hugging Face
implementation, without compromising model quality. The code is available at
https://github.com/FMInference/DejaVu.