Turbo Sparse：通过最小激活参数实现LLM SOTA性能

摘要

利用激活稀疏性是显著加速大型语言模型（LLMs）推理过程的一种有前途的方法，而不会影响性能。然而，激活稀疏性取决于激活函数，常用的如SwiGLU和GeGLU等表现出有限的稀疏性。简单地用ReLU替换这些函数无法实现足够的稀疏性。此外，训练数据不足可能进一步增加性能下降的风险。为了解决这些挑战，我们提出了一种新颖的dReLU函数，旨在改善LLM激活稀疏性，同时提供高质量的训练数据混合比例以促进有效的稀疏化。此外，我们利用混合专家模型（MoE）中前馈网络（FFN）专家内的稀疏激活模式，进一步提高效率。通过将我们的神经元稀疏化方法应用于Mistral和Mixtral模型，分别在每次推理迭代中仅激活25亿和43亿个参数，同时实现更强大的模型性能。评估结果表明，这种稀疏性实现了2-5倍的解码加速。值得注意的是，在手机上，我们的TurboSparse-Mixtral-47B实现了每秒11个标记的推理速度。我们的模型可在https://huggingface.co/PowerInfer获得。

English

Exploiting activation sparsity is a promising approach to significantly accelerating the inference process of large language models (LLMs) without compromising performance. However, activation sparsity is determined by activation functions, and commonly used ones like SwiGLU and GeGLU exhibit limited sparsity. Simply replacing these functions with ReLU fails to achieve sufficient sparsity. Moreover, inadequate training data can further increase the risk of performance degradation. To address these challenges, we propose a novel dReLU function, which is designed to improve LLM activation sparsity, along with a high-quality training data mixture ratio to facilitate effective sparsification. Additionally, we leverage sparse activation patterns within the Feed-Forward Network (FFN) experts of Mixture-of-Experts (MoE) models to further boost efficiency. By applying our neuron sparsification method to the Mistral and Mixtral models, only 2.5 billion and 4.3 billion parameters are activated per inference iteration, respectively, while achieving even more powerful model performance. Evaluation results demonstrate that this sparsity achieves a 2-5x decoding speedup. Remarkably, on mobile phones, our TurboSparse-Mixtral-47B achieves an inference speed of 11 tokens per second. Our models are available at https://huggingface.co/PowerInfer

Turbo Sparse：通过最小激活参数实现LLM SOTA性能

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

摘要

Support