Meting van Maximumactivaties in Open Grote Taalmodellen

Samenvatting

Het dynamisch bereik van activaties is een eersteklas beperking voor laag-bit kwantisatie, activatieschaling en stabiele LLM-inferentie. Eerder werk karakteriseerde uitbijterkenmerken en massale activaties op pre-2024 LLaMA-achtige modellen, en de stroomafwaartse activatie-kwantisatiestack erft dat beeld zonder het te herzien voor de post-LLaMA open-model boom. We stellen de implementatiegerichte vraag: hoe groot kunnen activaties worden in moderne open LLMs, en hoe varieert deze grootte tussen families, generaties en trainingsfasen? Onder een uniforme pijplijn (5.000-steekproef multi-domein corpus, familiespecifieke tokenisatie, identieke hooks over embeddings, verborgen toestanden, aandacht, MLP/MoE, SwiGLU-poorten en finale norm) meten we globale en laagsgewijze maxima op 27 checkpoints van 8 open families, variërend van dichte, MoE, visie-taal, tussentijdse training en instructie-getunede varianten. We vinden dat (i) globale maxima bijna vier ordegroottes bestrijken bij vergelijkbare parameteraantallen, met Qwen3.5- en MoE-checkpoints in het bereik 10² tot 10³ en Gemma3-27B-it dat ~7 × 10⁵ bereikt; (ii) cross-family en cross-generatie vergelijkingen eenvoudige monotone schaling doorbreken; en (iii) MoE-checkpoints vertonen 14.0-23.4× lagere pieken dan gematchte dichte tegenhangers, terwijl de residustroom het globale maximum draagt in 22/24 checkpoints. Een lichte INT-8 gezondheidscheck toont aan dat gemeten maxima covariëren met laag-bit reconstructiefout via activatie-schaalkeuze. We concluderen dat de maximale activatiegrootte een modeleigenschap is die gebonden is aan familie, architectuur en trainingsfase - geen simpel bijproduct van grootte - en moet worden gemeten en gerapporteerd naast elke open-gewicht vrijgave vóór laag-bit implementatie. De code is openbaar beschikbaar op https://github.com/clx1415926/Max_act_llm.

English

The dynamic range of activations is a first-order constraint for low-bit quantization, activation scaling, and stable LLM inference. Prior work characterized outlier features and massive activations on pre-2024 LLaMA-style models, and the downstream activation-quantization stack inherits that picture without revisiting it for the post-LLaMA open-model boom. We ask the deployment-oriented question: how large can activations get in modern open LLMs, and how does this magnitude vary across families, generations, and training stages? Under a unified pipeline (5,000-sample multi-domain corpus, family-specific tokenization, identical hooks across embeddings, hidden states, attention, MLP/MoE, SwiGLU gates, and final norm), we measure global and layerwise maxima on 27 checkpoints from 8 open families spanning dense, MoE, vision-language, intermediate-training, and instruction-tuned variants. We find that (i) global maxima span over nearly four orders of magnitude at comparable parameter counts, with Qwen3.5 and MoE checkpoints in the 10^2 to 10^3 range and Gemma3-27B-it reaching ~7 x 10^5; (ii) cross-family and cross-generation comparisons break simple monotonic scaling; and (iii) MoE checkpoints exhibit 14.0-23.4x lower peaks than matched-scale dense counterparts, while the residual stream carries the global maximum in 22/24 checkpoints. A lightweight INT-8 sanity check shows that measured maxima co-vary with low-bit reconstruction error via activation-scale selection. We conclude that maximum activation magnitude is a model property tied to family, architecture, and training stage - not a simple byproduct of size - and should be measured and reported alongside any open-weight release before low-bit deployment. The code is publicly available at https://github.com/clx1415926/Max_act_llm.