开源大语言模型中的最大激活值测量

摘要

激活值的动态范围是低比特量化、激活缩放以及大语言模型稳定推理的一阶约束。先前的研究描述了2024年前LLaMA风格模型中的异常特征与巨大激活值，而下游的激活量化堆栈沿用这一认知，未在LLaMA后时代开源模型大爆发背景下重新审视。我们提出面向部署的问题：现代开源大语言模型的激活值最大可达多大？这一规模在不同模型家族、代际及训练阶段间如何变化？在统一流程下（5000样本多领域语料库、家族专属分词、嵌入层、隐藏状态、注意力机制、MLP/MoE、SwiGLU门控及最终归一化层使用相同钩子），我们对8个开源家族的27个检查点（涵盖密集、MoE、视觉-语言、中间训练及指令微调变体）测量全局与逐层最大值。我们发现：（i）在相似参数量下，全局最大值跨越近四个数量级，Qwen3.5和MoE检查点落在10^2到10^3范围，而Gemma3-27B-it达到约7×10^5；（ii）跨家族与跨代比较打破了简单的单调缩放规律；（iii）MoE检查点的峰值比同规模密集模型低14.0到23.4倍，而残差流在22/24个检查点中承载全局最大值。轻量级INT-8正确性检验表明，测得的极大值与低比特重建误差通过激活尺度选择共同变化。我们得出结论：最大激活幅度是一个与家族、架构及训练阶段相关的模型属性，而非参数量的简单副产品——在低比特部署前，应将其作为量化指标，与任何开源权重一同测量并发布。代码公开于：https://github.com/clx1415926/Max_act_llm。

English

The dynamic range of activations is a first-order constraint for low-bit quantization, activation scaling, and stable LLM inference. Prior work characterized outlier features and massive activations on pre-2024 LLaMA-style models, and the downstream activation-quantization stack inherits that picture without revisiting it for the post-LLaMA open-model boom. We ask the deployment-oriented question: how large can activations get in modern open LLMs, and how does this magnitude vary across families, generations, and training stages? Under a unified pipeline (5,000-sample multi-domain corpus, family-specific tokenization, identical hooks across embeddings, hidden states, attention, MLP/MoE, SwiGLU gates, and final norm), we measure global and layerwise maxima on 27 checkpoints from 8 open families spanning dense, MoE, vision-language, intermediate-training, and instruction-tuned variants. We find that (i) global maxima span over nearly four orders of magnitude at comparable parameter counts, with Qwen3.5 and MoE checkpoints in the 10^2 to 10^3 range and Gemma3-27B-it reaching ~7 x 10^5; (ii) cross-family and cross-generation comparisons break simple monotonic scaling; and (iii) MoE checkpoints exhibit 14.0-23.4x lower peaks than matched-scale dense counterparts, while the residual stream carries the global maximum in 22/24 checkpoints. A lightweight INT-8 sanity check shows that measured maxima co-vary with low-bit reconstruction error via activation-scale selection. We conclude that maximum activation magnitude is a model property tied to family, architecture, and training stage - not a simple byproduct of size - and should be measured and reported alongside any open-weight release before low-bit deployment. The code is publicly available at https://github.com/clx1415926/Max_act_llm.