대규모 언어 모델에서의 대량 활성화에 대한 정밀 분석

초록

저정밀도 학습 및 양자화와의 관련성에 부분적으로 동기를 받아, 대규모 언어 모델(LLMs)에서의 대규모 활성화가 최근 관심 주제로 부상했습니다. 그러나 기존 분석은 범위가 제한적이며, 아키텍처 간 일반화 가능성은 불분명합니다. 본 논문은 GLU 기반 및 비 GLU 기반 아키텍처를 포함한 다양한 LLMs에 걸친 대규모 활성화 분석을 수행함으로써 이러한 격차 중 일부를 해소하는 데 기여합니다. 우리의 연구 결과는 여러 사전 가정에 도전하는데, 가장 중요한 것은 다음과 같습니다: (1) 모든 대규모 활성화가 해로운 것은 아니며, 이를 억제하더라도 perplexity의 폭발적 증가나 하위 작업 성능의 붕괴로 이어지지 않는다는 점; (2) Attention KV bias와 같은 제안된 완화 전략은 모델 특정적이며 특정 경우에는 효과적이지 않다는 점. 이에 따라 우리는 새로운 하이브리드 완화 전략을 조사했습니다; 특히 Target Variance Rescaling (TVR)을 Attention KV bias 또는 Dynamic Tanh (DyT)과 결합하는 것이 대규모 활성화 완화와 하위 모델 성능 보존 사이의 균형을 성공적으로 유지하는 것으로 나타났습니다. 우리의 코드는 https://github.com/bluorion-com/refine_massive_activations에서 확인할 수 있습니다.

English

Motivated in part by their relevance for low-precision training and quantization, massive activations in large language models (LLMs) have recently emerged as a topic of interest. However, existing analyses are limited in scope, and generalizability across architectures is unclear. This paper helps address some of these gaps by conducting an analysis of massive activations across a broad range of LLMs, including both GLU-based and non-GLU-based architectures. Our findings challenge several prior assumptions, most importantly: (1) not all massive activations are detrimental, i.e. suppressing them does not lead to an explosion of perplexity or a collapse in downstream task performance; (2) proposed mitigation strategies such as Attention KV bias are model-specific and ineffective in certain cases. We consequently investigate novel hybrid mitigation strategies; in particular pairing Target Variance Rescaling (TVR) with Attention KV bias or Dynamic Tanh (DyT) successfully balances the mitigation of massive activations with preserved downstream model performance in the scenarios we investigated. Our code is available at: https://github.com/bluorion-com/refine_massive_activations.

대규모 언어 모델에서의 대량 활성화에 대한 정밀 분석

A Refined Analysis of Massive Activations in LLMs

초록

Support