スパース化法：より高い活性化を持つ大規模言語モデルに向けてスパーシティ

要旨

活性化スパース性は、活性化出力内に存在する重要でない要素を排除することで多くの重要な応用に利益をもたらすことを示しています。大規模言語モデル（LLM）に関連する多くの重要なアプリケーションがその対象です。LLM内の活性化スパース性を促進することは深い研究が必要ですが、既存の研究は活性化スパース性と潜在的に影響を与える要因との相関に関する包括的で定量的な研究が不足しています。本論文では、デコーダーのみを持つTransformerベースのLLM内の活性化スパース性の定量的スケーリング特性と影響要因に関する包括的な研究を提案します。具体的には、任意の活性化関数に適用可能な正確でパフォーマンスを考慮した活性化スパース性メトリクスであるPPL-p%スパース性を提案します。広範な実験を通じて、いくつかの重要な現象を発見しました。まず、異なる活性化関数は類似のパフォーマンスを示しますが、トレーニング時のスパース性の傾向は対照的です。活性化比率（すなわち、1-スパース比率）は、SiLU活性化およびReLU活性化されたLLMに対して、トレーニングデータの量に応じて収束する増加べき乗則と減少する対数空間べき乗則として進化します。これらは、ReLUがSiLUよりも活性化関数として効率的であり、より多くのトレーニングデータを活用して活性化スパース性を向上させることができることを示しています。第二に、特定のボトルネックポイント以下では、幅-深さ比率と活性化比率が線形に増加し、固定されたパラメータスケールでより深いアーキテクチャの潜在的な利点を示しています。最後に、類似の幅-深さ比率で、活性化スパース性の限界値がパラメータスケールに弱く変化することを驚くべきことに発見しました。つまり、LLM内の活性化パターンはパラメータスケールに対して鈍感です。これらのLLMにおける活性化スパース性に関する経験則は、LLMをより効率的かつ解釈可能にするための重要な示唆を提供しています。

English

Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs). Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors. In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs. Specifically, we propose PPL-p% sparsity, a precise and performance-aware activation sparsity metric that is applicable to any activation function. Through extensive experiments, we find several important phenomena. Firstly, different activation functions exhibit comparable performance but opposite training-time sparsity trends. The activation ratio (i.e., 1-sparsity ratio) evolves as a convergent increasing power-law and decreasing logspace power-law with the amount of training data for SiLU-activated and ReLU-activated LLMs, respectively. These demonstrate that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity. Secondly, the activation ratio linearly increases with the width-depth ratio below a certain bottleneck point, indicating the potential advantage of a deeper architecture at a fixed parameter scale. Finally, at similar width-depth ratios, we surprisingly find that the limit value of activation sparsity varies weakly with the parameter scale, i.e., the activation patterns within LLMs are insensitive to the parameter scale. These empirical laws towards LLMs with greater activation sparsity have important implications for making LLMs more efficient and interpretable.

スパース化法：より高い活性化を持つ大規模言語モデルに向けてスパーシティ

Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

要旨

Support