開放式大型語言模型中最大激活值的測量
Measuring Maximum Activations in Open Large Language Models
May 15, 2026
作者: Luxuan Chen, Han Tian, Xinran Chen, Rui Kong, Fang Wang, Jiamin Chen, Yuchen Li, Jiashu Zhao, Shuaiqiang Wang, Haoyi Xiong, Dawei Yin
cs.AI
摘要
激活函數的動態範圍是低位元量化、激活縮放及穩定大型語言模型推論的首要約束條件。先前的研究針對2024年之前的LLaMA架構模型,描述了異常特徵與大規模激活現象,而下游的激活量化堆疊技術沿用了此觀點,未就後LLaMA時代開源模型蓬勃發展的現狀重新檢視。我們從部署角度提出問題:在現代開源大型語言模型中,激活值最大可達何種規模?此規模在不同模型系列、世代與訓練階段間又如何變化?透過統一管線(5,000樣本多領域語料庫、各系列專屬分詞器、嵌入層、隱藏狀態、注意力機制、MLP/MoE、SwiGLU閘控與最終正規化層採用相同掛鉤),我們測量了8個開源系列(涵蓋密集模型、MoE、視覺語言模型、中間訓練版本及指令微調版本)共27個檢查點的全局與逐層最大值。我們發現:(i)在可比參數規模下,全局最大值橫跨近四個數量級,Qwen3.5與MoE檢查點落在10²至10³範圍,而Gemma3-27B-it達到約7×10⁵;(ii)跨系列與跨世代比較打破了簡單的單調縮放規律;(iii)MoE檢查點的峰值比同規模密集模型低14.0至23.4倍,而殘差流在22/24個檢查點中承載了全局最大值。透過輕量INT-8驗證,顯示測量最大值與低位元重建誤差(透過激活尺度選擇)存在共變關係。我們結論:最大激活量級是與模型系列、架構及訓練階段密切相關的模型屬性,並非單純由模型大小衍生,應在低位元部署前,隨任何開源權重釋出一併測量與報告。程式碼已公開於 https://github.com/clx1415926/Max_act_llm。
English
The dynamic range of activations is a first-order constraint for low-bit quantization, activation scaling, and stable LLM inference. Prior work characterized outlier features and massive activations on pre-2024 LLaMA-style models, and the downstream activation-quantization stack inherits that picture without revisiting it for the post-LLaMA open-model boom. We ask the deployment-oriented question: how large can activations get in modern open LLMs, and how does this magnitude vary across families, generations, and training stages? Under a unified pipeline (5,000-sample multi-domain corpus, family-specific tokenization, identical hooks across embeddings, hidden states, attention, MLP/MoE, SwiGLU gates, and final norm), we measure global and layerwise maxima on 27 checkpoints from 8 open families spanning dense, MoE, vision-language, intermediate-training, and instruction-tuned variants. We find that (i) global maxima span over nearly four orders of magnitude at comparable parameter counts, with Qwen3.5 and MoE checkpoints in the 10^2 to 10^3 range and Gemma3-27B-it reaching ~7 x 10^5; (ii) cross-family and cross-generation comparisons break simple monotonic scaling; and (iii) MoE checkpoints exhibit 14.0-23.4x lower peaks than matched-scale dense counterparts, while the residual stream carries the global maximum in 22/24 checkpoints. A lightweight INT-8 sanity check shows that measured maxima co-vary with low-bit reconstruction error via activation-scale selection. We conclude that maximum activation magnitude is a model property tied to family, architecture, and training stage - not a simple byproduct of size - and should be measured and reported alongside any open-weight release before low-bit deployment. The code is publicly available at https://github.com/clx1415926/Max_act_llm.