大型語言模型中大量激活的細化分析
A Refined Analysis of Massive Activations in LLMs
March 28, 2025
作者: Louis Owen, Nilabhra Roy Chowdhury, Abhay Kumar, Fabian Güra
cs.AI
摘要
部分受到低精度训练和量化相关性的驱动,大型语言模型(LLMs)中的大规模激活现象近来成为了一个备受关注的话题。然而,现有分析的范围有限,且跨架构的普适性尚不明确。本文通过分析包括基于GLU和非基于GLU架构在内的广泛LLMs中的大规模激活,帮助填补了部分空白。我们的发现挑战了先前的一些假设,其中最重要的是:(1) 并非所有大规模激活都是有害的,即抑制它们不会导致困惑度爆炸或下游任务性能崩溃;(2) 提出的缓解策略如注意力KV偏置是模型特定的,在某些情况下无效。因此,我们探索了新颖的混合缓解策略;特别是将目标方差重缩放(TVR)与注意力KV偏置或动态Tanh(DyT)结合,在我们研究的场景中成功平衡了对大规模激活的缓解与下游模型性能的保持。我们的代码可在以下网址获取:https://github.com/bluorion-com/refine_massive_activations。
English
Motivated in part by their relevance for low-precision training and
quantization, massive activations in large language models (LLMs) have recently
emerged as a topic of interest. However, existing analyses are limited in
scope, and generalizability across architectures is unclear. This paper helps
address some of these gaps by conducting an analysis of massive activations
across a broad range of LLMs, including both GLU-based and non-GLU-based
architectures. Our findings challenge several prior assumptions, most
importantly: (1) not all massive activations are detrimental, i.e. suppressing
them does not lead to an explosion of perplexity or a collapse in downstream
task performance; (2) proposed mitigation strategies such as Attention KV bias
are model-specific and ineffective in certain cases. We consequently
investigate novel hybrid mitigation strategies; in particular pairing Target
Variance Rescaling (TVR) with Attention KV bias or Dynamic Tanh (DyT)
successfully balances the mitigation of massive activations with preserved
downstream model performance in the scenarios we investigated. Our code is
available at: https://github.com/bluorion-com/refine_massive_activations.Summary
AI-Generated Summary