オープンな大規模言語モデルにおける最大活性化の計測

要旨

アクティベーションのダイナミックレンジは、低ビット量子化、アクティベーションスケーリング、および安定したLLM推論における一次制約である。先行研究では、2024年以前のLLaMA型モデルにおける外れ値特徴量や巨大アクティベーションの特性が明らかにされており、後続のアクティベーション量子化スタックもこの知見を継承しているが、ポストLLaMA時代のオープンモデルの隆盛を踏まえた再検討は行われていない。本研究では、導入を目的とした問いを立てる：現代のオープンLLMにおいてアクティベーションはどの程度の大きさになり得るのか、またその規模はファミリー、世代、訓練段階によってどのように異なるのか。統一パイプライン（5000サンプルのマルチドメインコーパス、ファミリー固有のトークン化、埋め込み・隠れ状態・アテンション・MLP/MoE・SwiGLUゲート・最終ノルムにわたる同一フック）を用いて、8つのオープンファミリーから27個のチェックポイント（高密度モデル、MoE、視覚言語モデル、中間訓練モデル、命令チューニングモデルを含む）に対して、全体および層ごとの最大値を測定した。その結果、(i) 同程度のパラメータ数でも全体の最大値はほぼ4桁にわたり、Qwen3.5とMoEチェックポイントでは10²～10³の範囲、Gemma3-27B-itでは約7×10⁵に達すること、(ii) ファミリー間・世代間の比較では単純な単調スケーリングは成立しないこと、(iii) MoEチェックポイントでは同規模の高密度モデルと比較してピーク値が14.0～23.4倍低く、22/24のチェックポイントでは残差ストリームが全体最大値を担うことが明らかとなった。軽量なINT-8 sanity checkにより、測定された最大値はアクティベーションスケールの選択を介して低ビット再構成誤差と共変することが示された。以上より、アクティベーションの最大値の大きさは、サイズの単純な副産物ではなく、ファミリー、アーキテクチャ、訓練段階に結びついたモデル特性であり、低ビット展開の前に、オープンウェイトリリースとともに測定・報告されるべきであると結論づける。コードはhttps://github.com/clx1415926/Max_act_llmで公開されている。

English

The dynamic range of activations is a first-order constraint for low-bit quantization, activation scaling, and stable LLM inference. Prior work characterized outlier features and massive activations on pre-2024 LLaMA-style models, and the downstream activation-quantization stack inherits that picture without revisiting it for the post-LLaMA open-model boom. We ask the deployment-oriented question: how large can activations get in modern open LLMs, and how does this magnitude vary across families, generations, and training stages? Under a unified pipeline (5,000-sample multi-domain corpus, family-specific tokenization, identical hooks across embeddings, hidden states, attention, MLP/MoE, SwiGLU gates, and final norm), we measure global and layerwise maxima on 27 checkpoints from 8 open families spanning dense, MoE, vision-language, intermediate-training, and instruction-tuned variants. We find that (i) global maxima span over nearly four orders of magnitude at comparable parameter counts, with Qwen3.5 and MoE checkpoints in the 10^2 to 10^3 range and Gemma3-27B-it reaching ~7 x 10^5; (ii) cross-family and cross-generation comparisons break simple monotonic scaling; and (iii) MoE checkpoints exhibit 14.0-23.4x lower peaks than matched-scale dense counterparts, while the residual stream carries the global maximum in 22/24 checkpoints. A lightweight INT-8 sanity check shows that measured maxima co-vary with low-bit reconstruction error via activation-scale selection. We conclude that maximum activation magnitude is a model property tied to family, architecture, and training stage - not a simple byproduct of size - and should be measured and reported alongside any open-weight release before low-bit deployment. The code is publicly available at https://github.com/clx1415926/Max_act_llm.