Turbo Sparse：以最少的啟動參數實現LLM SOTA性能

摘要

利用激活稀疏性是一種有前途的方法，可以顯著加速大型語言模型（LLMs）的推論過程，同時不影響性能。然而，激活稀疏性取決於激活函數，常用的函數如SwiGLU和GeGLU表現出有限的稀疏性。僅僅將這些函數替換為ReLU無法達到足夠的稀疏性。此外，不足的訓練數據還可能進一步增加性能下降的風險。為應對這些挑戰，我們提出了一種新穎的dReLU函數，旨在改善LLM激活稀疏性，並提供高質量的訓練數據混合比例，以促進有效的稀疏化。此外，我們利用混合專家（MoE）模型中前馈網絡（FFN）專家內的稀疏激活模式，進一步提高效率。通過將我們的神經元稀疏化方法應用於Mistral和Mixtral模型，分別在每次推論迭代中僅激活25億和43億個參數，同時實現更強大的模型性能。評估結果顯示，這種稀疏性實現了2-5倍的解碼加速。值得注意的是，在手機上，我們的TurboSparse-Mixtral-47B實現了每秒11個標記的推論速度。我們的模型可在https://huggingface.co/PowerInfer找到。

English

Exploiting activation sparsity is a promising approach to significantly accelerating the inference process of large language models (LLMs) without compromising performance. However, activation sparsity is determined by activation functions, and commonly used ones like SwiGLU and GeGLU exhibit limited sparsity. Simply replacing these functions with ReLU fails to achieve sufficient sparsity. Moreover, inadequate training data can further increase the risk of performance degradation. To address these challenges, we propose a novel dReLU function, which is designed to improve LLM activation sparsity, along with a high-quality training data mixture ratio to facilitate effective sparsification. Additionally, we leverage sparse activation patterns within the Feed-Forward Network (FFN) experts of Mixture-of-Experts (MoE) models to further boost efficiency. By applying our neuron sparsification method to the Mistral and Mixtral models, only 2.5 billion and 4.3 billion parameters are activated per inference iteration, respectively, while achieving even more powerful model performance. Evaluation results demonstrate that this sparsity achieves a 2-5x decoding speedup. Remarkably, on mobile phones, our TurboSparse-Mixtral-47B achieves an inference speed of 11 tokens per second. Our models are available at https://huggingface.co/PowerInfer

Turbo Sparse：以最少的啟動參數實現LLM SOTA性能

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

摘要

Support