ChatPaper.aiChatPaper

开源大语言模型中的最大激活值测量

Measuring Maximum Activations in Open Large Language Models

May 15, 2026
作者: Luxuan Chen, Han Tian, Xinran Chen, Rui Kong, Fang Wang, Jiamin Chen, Yuchen Li, Jiashu Zhao, Shuaiqiang Wang, Haoyi Xiong, Dawei Yin
cs.AI

摘要

激活值的动态范围是低比特量化、激活缩放以及大语言模型稳定推理的一阶约束。先前的研究描述了2024年前LLaMA风格模型中的异常特征与巨大激活值,而下游的激活量化堆栈沿用这一认知,未在LLaMA后时代开源模型大爆发背景下重新审视。我们提出面向部署的问题:现代开源大语言模型的激活值最大可达多大?这一规模在不同模型家族、代际及训练阶段间如何变化?在统一流程下(5000样本多领域语料库、家族专属分词、嵌入层、隐藏状态、注意力机制、MLP/MoE、SwiGLU门控及最终归一化层使用相同钩子),我们对8个开源家族的27个检查点(涵盖密集、MoE、视觉-语言、中间训练及指令微调变体)测量全局与逐层最大值。我们发现:(i)在相似参数量下,全局最大值跨越近四个数量级,Qwen3.5和MoE检查点落在10^2到10^3范围,而Gemma3-27B-it达到约7×10^5;(ii)跨家族与跨代比较打破了简单的单调缩放规律;(iii)MoE检查点的峰值比同规模密集模型低14.0到23.4倍,而残差流在22/24个检查点中承载全局最大值。轻量级INT-8正确性检验表明,测得的极大值与低比特重建误差通过激活尺度选择共同变化。我们得出结论:最大激活幅度是一个与家族、架构及训练阶段相关的模型属性,而非参数量的简单副产品——在低比特部署前,应将其作为量化指标,与任何开源权重一同测量并发布。代码公开于:https://github.com/clx1415926/Max_act_llm。
English
The dynamic range of activations is a first-order constraint for low-bit quantization, activation scaling, and stable LLM inference. Prior work characterized outlier features and massive activations on pre-2024 LLaMA-style models, and the downstream activation-quantization stack inherits that picture without revisiting it for the post-LLaMA open-model boom. We ask the deployment-oriented question: how large can activations get in modern open LLMs, and how does this magnitude vary across families, generations, and training stages? Under a unified pipeline (5,000-sample multi-domain corpus, family-specific tokenization, identical hooks across embeddings, hidden states, attention, MLP/MoE, SwiGLU gates, and final norm), we measure global and layerwise maxima on 27 checkpoints from 8 open families spanning dense, MoE, vision-language, intermediate-training, and instruction-tuned variants. We find that (i) global maxima span over nearly four orders of magnitude at comparable parameter counts, with Qwen3.5 and MoE checkpoints in the 10^2 to 10^3 range and Gemma3-27B-it reaching ~7 x 10^5; (ii) cross-family and cross-generation comparisons break simple monotonic scaling; and (iii) MoE checkpoints exhibit 14.0-23.4x lower peaks than matched-scale dense counterparts, while the residual stream carries the global maximum in 22/24 checkpoints. A lightweight INT-8 sanity check shows that measured maxima co-vary with low-bit reconstruction error via activation-scale selection. We conclude that maximum activation magnitude is a model property tied to family, architecture, and training stage - not a simple byproduct of size - and should be measured and reported alongside any open-weight release before low-bit deployment. The code is publicly available at https://github.com/clx1415926/Max_act_llm.