WINA：基於權重信息的神經元激活加速大型語言模型推理

摘要

大型語言模型（LLMs）日益增長的計算需求使得高效的推理與激活策略變得愈發關鍵。儘管近期方法如專家混合（Mixture-of-Experts, MoE）利用選擇性激活，但需要專門的訓練，而無需訓練的稀疏激活方法則通過其即插即用的設計提供了更廣泛的適用性和更優的資源效率。然而，許多現有方法僅依賴隱藏狀態的幅度來決定激活，導致高近似誤差和次優的推理精度。為解決這些限制，我們提出了WINA（Weight Informed Neuron Activation），這是一種新穎、簡單且無需訓練的稀疏激活框架，它同時考慮隱藏狀態的幅度和權重矩陣的列向ℓ₂範數。我們證明，這導致了一種稀疏化策略，能夠獲得最優的近似誤差界限，其理論保證比現有技術更為嚴格。實證上，WINA在相同稀疏度下，於多樣化的LLM架構和數據集上，平均性能比最先進的方法（如TEAL）高出最多2.94%。這些結果將WINA定位為LLM推理中無需訓練稀疏激活的新性能前沿，推動了無需訓練稀疏激活方法的發展，並為高效推理設立了堅實的基準。源代碼可在https://github.com/microsoft/wina獲取。

English

The growing computational demands of large language models (LLMs) make efficient inference and activation strategies increasingly critical. While recent approaches, such as Mixture-of-Experts (MoE), leverage selective activation but require specialized training, training-free sparse activation methods offer broader applicability and superior resource efficiency through their plug-and-play design. However, many existing methods rely solely on hidden state magnitudes to determine activation, resulting in high approximation errors and suboptimal inference accuracy. To address these limitations, we propose WINA (Weight Informed Neuron Activation), a novel, simple, and training-free sparse activation framework that jointly considers hidden state magnitudes and the column-wise ell_2-norms of weight matrices. We show that this leads to a sparsification strategy that obtains optimal approximation error bounds with theoretical guarantees tighter than existing techniques. Empirically, WINA also outperforms state-of-the-art methods (e.g., TEAL) by up to 2.94% in average performance at the same sparsity levels, across a diverse set of LLM architectures and datasets. These results position WINA as a new performance frontier for training-free sparse activation in LLM inference, advancing training-free sparse activation methods and setting a robust baseline for efficient inference. The source code is available at https://github.com/microsoft/wina.

WINA：基於權重信息的神經元激活加速大型語言模型推理

WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference

摘要

Support