오픈 대규모 언어 모델에서의 최대 활성화 측정

초록

활성화의 동적 범위는 저비트 양자화, 활성화 스케일링 및 안정적 LLM 추론에 있어 1차 제약 조건이다. 선행 연구는 2024년 이전 LLaMA 스타일 모델에서 이상치 특징과 거대 활성화를 특성화했으며, 다운스트림 활성화-양자화 스택은 포스트-LLaMA 오픈 모델 붐 이후 이를 재검토하지 않고 그 그림을 계승한다. 우리는 배포 지향적 질문을 제기한다: 현대 오픈 LLM에서 활성화가 얼마나 커질 수 있으며, 그 크기는 계열, 세대, 훈련 단계에 따라 어떻게 달라지는가? 통합 파이프라인(5,000개 샘플 다중 도메인 코퍼스, 계열별 토큰화, 임베딩, 은닉 상태, 어텐션, MLP/MoE, SwiGLU 게이트 및 최종 정규화에 걸친 동일한 후크) 하에서, 우리는 8개 오픈 계열(밀집, MoE, 비전-언어, 중간 훈련 및 명령어 튜닝 변형 포함)의 27개 체크포인트에서 전역 및 계층별 최댓값을 측정한다. 그 결과, (i) 전역 최댓값은 비교 가능한 파라미터 수에서 거의 네 자릿수에 걸쳐 있으며, Qwen3.5 및 MoE 체크포인트는 10^2에서 10^3 범위, Gemma3-27B-it은 약 7×10^5에 도달한다; (ii) 계열 간 및 세대 간 비교는 단순 단조적 스케일링을 깨뜨린다; (iii) MoE 체크포인트는 동일 규모의 밀집 대응 모델보다 14.0–23.4배 낮은 피크를 보이며, 잔차 스트림이 22/24 체크포인트에서 전역 최댓값을 전달한다. 경량 INT-8 검증을 통해 측정된 최댓값이 활성화 스케일 선택을 통해 저비트 재구성 오류와 공변함을 보인다. 우리는 최대 활성화 크기가 계열, 아키텍처 및 훈련 단계에 묶인 모델 속성이며, 단순한 크기의 부산물이 아니라는 결론을 내린다. 따라서 저비트 배포 전에 오픈 가중치 공개와 함께 이를 측정하고 보고해야 한다. 코드는 https://github.com/clx1415926/Max_act_llm에서 공개적으로 이용 가능하다.

English

The dynamic range of activations is a first-order constraint for low-bit quantization, activation scaling, and stable LLM inference. Prior work characterized outlier features and massive activations on pre-2024 LLaMA-style models, and the downstream activation-quantization stack inherits that picture without revisiting it for the post-LLaMA open-model boom. We ask the deployment-oriented question: how large can activations get in modern open LLMs, and how does this magnitude vary across families, generations, and training stages? Under a unified pipeline (5,000-sample multi-domain corpus, family-specific tokenization, identical hooks across embeddings, hidden states, attention, MLP/MoE, SwiGLU gates, and final norm), we measure global and layerwise maxima on 27 checkpoints from 8 open families spanning dense, MoE, vision-language, intermediate-training, and instruction-tuned variants. We find that (i) global maxima span over nearly four orders of magnitude at comparable parameter counts, with Qwen3.5 and MoE checkpoints in the 10^2 to 10^3 range and Gemma3-27B-it reaching ~7 x 10^5; (ii) cross-family and cross-generation comparisons break simple monotonic scaling; and (iii) MoE checkpoints exhibit 14.0-23.4x lower peaks than matched-scale dense counterparts, while the residual stream carries the global maximum in 22/24 checkpoints. A lightweight INT-8 sanity check shows that measured maxima co-vary with low-bit reconstruction error via activation-scale selection. We conclude that maximum activation magnitude is a model property tied to family, architecture, and training stage - not a simple byproduct of size - and should be measured and reported alongside any open-weight release before low-bit deployment. The code is publicly available at https://github.com/clx1415926/Max_act_llm.