WINA：基于权重信息的神经元激活加速大型语言模型推理

摘要

随着大型语言模型（LLMs）计算需求的日益增长，高效的推理与激活策略变得愈发关键。尽管近期如专家混合（Mixture-of-Experts, MoE）等方法通过选择性激活提升了效率，但需专门训练，而无需训练的稀疏激活方法凭借其即插即用的设计，提供了更广泛的适用性和更优的资源效率。然而，现有许多方法仅依赖隐藏状态的大小来决定激活，导致较高的近似误差和次优的推理精度。为克服这些局限，我们提出了WINA（权重信息神经元激活），一种新颖、简单且无需训练的稀疏激活框架，它同时考虑了隐藏状态的大小及权重矩阵列向量的ℓ₂范数。我们证明，这一策略能获得最优的近似误差界限，其理论保证比现有技术更为严格。实证表明，在相同稀疏度下，WINA在多种LLM架构和数据集上的平均性能比最先进方法（如TEAL）高出最多2.94%。这些成果确立了WINA在LLM推理中无需训练稀疏激活方法的新性能前沿，推动了该领域的发展，并为高效推理设立了坚实的基准。源代码已发布于https://github.com/microsoft/wina。

English

The growing computational demands of large language models (LLMs) make efficient inference and activation strategies increasingly critical. While recent approaches, such as Mixture-of-Experts (MoE), leverage selective activation but require specialized training, training-free sparse activation methods offer broader applicability and superior resource efficiency through their plug-and-play design. However, many existing methods rely solely on hidden state magnitudes to determine activation, resulting in high approximation errors and suboptimal inference accuracy. To address these limitations, we propose WINA (Weight Informed Neuron Activation), a novel, simple, and training-free sparse activation framework that jointly considers hidden state magnitudes and the column-wise ell_2-norms of weight matrices. We show that this leads to a sparsification strategy that obtains optimal approximation error bounds with theoretical guarantees tighter than existing techniques. Empirically, WINA also outperforms state-of-the-art methods (e.g., TEAL) by up to 2.94% in average performance at the same sparsity levels, across a diverse set of LLM architectures and datasets. These results position WINA as a new performance frontier for training-free sparse activation in LLM inference, advancing training-free sparse activation methods and setting a robust baseline for efficient inference. The source code is available at https://github.com/microsoft/wina.

WINA：基于权重信息的神经元激活加速大型语言模型推理

WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference

摘要

Support