開放式大型語言模型中最大激活值的測量

摘要

激活函數的動態範圍是低位元量化、激活縮放及穩定大型語言模型推論的首要約束條件。先前的研究針對2024年之前的LLaMA架構模型，描述了異常特徵與大規模激活現象，而下游的激活量化堆疊技術沿用了此觀點，未就後LLaMA時代開源模型蓬勃發展的現狀重新檢視。我們從部署角度提出問題：在現代開源大型語言模型中，激活值最大可達何種規模？此規模在不同模型系列、世代與訓練階段間又如何變化？透過統一管線（5,000樣本多領域語料庫、各系列專屬分詞器、嵌入層、隱藏狀態、注意力機制、MLP/MoE、SwiGLU閘控與最終正規化層採用相同掛鉤），我們測量了8個開源系列（涵蓋密集模型、MoE、視覺語言模型、中間訓練版本及指令微調版本）共27個檢查點的全局與逐層最大值。我們發現：（i）在可比參數規模下，全局最大值橫跨近四個數量級，Qwen3.5與MoE檢查點落在10²至10³範圍，而Gemma3-27B-it達到約7×10⁵；（ii）跨系列與跨世代比較打破了簡單的單調縮放規律；（iii）MoE檢查點的峰值比同規模密集模型低14.0至23.4倍，而殘差流在22/24個檢查點中承載了全局最大值。透過輕量INT-8驗證，顯示測量最大值與低位元重建誤差（透過激活尺度選擇）存在共變關係。我們結論：最大激活量級是與模型系列、架構及訓練階段密切相關的模型屬性，並非單純由模型大小衍生，應在低位元部署前，隨任何開源權重釋出一併測量與報告。程式碼已公開於 https://github.com/clx1415926/Max_act_llm。

English

The dynamic range of activations is a first-order constraint for low-bit quantization, activation scaling, and stable LLM inference. Prior work characterized outlier features and massive activations on pre-2024 LLaMA-style models, and the downstream activation-quantization stack inherits that picture without revisiting it for the post-LLaMA open-model boom. We ask the deployment-oriented question: how large can activations get in modern open LLMs, and how does this magnitude vary across families, generations, and training stages? Under a unified pipeline (5,000-sample multi-domain corpus, family-specific tokenization, identical hooks across embeddings, hidden states, attention, MLP/MoE, SwiGLU gates, and final norm), we measure global and layerwise maxima on 27 checkpoints from 8 open families spanning dense, MoE, vision-language, intermediate-training, and instruction-tuned variants. We find that (i) global maxima span over nearly four orders of magnitude at comparable parameter counts, with Qwen3.5 and MoE checkpoints in the 10^2 to 10^3 range and Gemma3-27B-it reaching ~7 x 10^5; (ii) cross-family and cross-generation comparisons break simple monotonic scaling; and (iii) MoE checkpoints exhibit 14.0-23.4x lower peaks than matched-scale dense counterparts, while the residual stream carries the global maximum in 22/24 checkpoints. A lightweight INT-8 sanity check shows that measured maxima co-vary with low-bit reconstruction error via activation-scale selection. We conclude that maximum activation magnitude is a model property tied to family, architecture, and training stage - not a simple byproduct of size - and should be measured and reported alongside any open-weight release before low-bit deployment. The code is publicly available at https://github.com/clx1415926/Max_act_llm.