WINA: 大規模言語モデル推論の高速化のための重み情報を考慮したニューロン活性化

要旨

大規模言語モデル（LLM）の計算需要の増大に伴い、効率的な推論と活性化戦略がますます重要になっています。最近のアプローチ、例えばMixture-of-Experts（MoE）は選択的活性化を活用しますが、専門的な訓練を必要とします。一方で、訓練不要のスパース活性化手法は、プラグアンドプレイ設計により幅広い適用性と優れたリソース効率を提供します。しかし、既存の多くの手法は活性化を決定するために隠れ状態の大きさのみに依存しており、高い近似誤差と最適でない推論精度をもたらしています。これらの課題を解決するため、我々はWINA（Weight Informed Neuron Activation）を提案します。これは、隠れ状態の大きさと重み行列の列ごとのℓ₂ノルムを同時に考慮する、新しくシンプルで訓練不要のスパース活性化フレームワークです。このアプローチにより、既存の技術よりも厳密な理論的保証を持つ最適な近似誤差限界を達成するスパース化戦略が得られることを示します。実験的にも、WINAは同じスパースレベルにおいて、最先端の手法（例：TEAL）を最大2.94%上回る平均性能を、多様なLLMアーキテクチャとデータセットで実現しています。これらの結果は、WINAをLLM推論における訓練不要スパース活性化の新たな性能フロンティアとして位置づけ、訓練不要スパース活性化手法を進化させ、効率的な推論のための堅牢なベースラインを確立します。ソースコードはhttps://github.com/microsoft/winaで公開されています。

English

The growing computational demands of large language models (LLMs) make efficient inference and activation strategies increasingly critical. While recent approaches, such as Mixture-of-Experts (MoE), leverage selective activation but require specialized training, training-free sparse activation methods offer broader applicability and superior resource efficiency through their plug-and-play design. However, many existing methods rely solely on hidden state magnitudes to determine activation, resulting in high approximation errors and suboptimal inference accuracy. To address these limitations, we propose WINA (Weight Informed Neuron Activation), a novel, simple, and training-free sparse activation framework that jointly considers hidden state magnitudes and the column-wise ell_2-norms of weight matrices. We show that this leads to a sparsification strategy that obtains optimal approximation error bounds with theoretical guarantees tighter than existing techniques. Empirically, WINA also outperforms state-of-the-art methods (e.g., TEAL) by up to 2.94% in average performance at the same sparsity levels, across a diverse set of LLM architectures and datasets. These results position WINA as a new performance frontier for training-free sparse activation in LLM inference, advancing training-free sparse activation methods and setting a robust baseline for efficient inference. The source code is available at https://github.com/microsoft/wina.

WINA: 大規模言語モデル推論の高速化のための重み情報を考慮したニューロン活性化

WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference

要旨

Support