大型語言模型中大量激活的細化分析

摘要

部分受到低精度训练和量化相关性的驱动，大型语言模型（LLMs）中的大规模激活现象近来成为了一个备受关注的话题。然而，现有分析的范围有限，且跨架构的普适性尚不明确。本文通过分析包括基于GLU和非基于GLU架构在内的广泛LLMs中的大规模激活，帮助填补了部分空白。我们的发现挑战了先前的一些假设，其中最重要的是：(1) 并非所有大规模激活都是有害的，即抑制它们不会导致困惑度爆炸或下游任务性能崩溃；(2) 提出的缓解策略如注意力KV偏置是模型特定的，在某些情况下无效。因此，我们探索了新颖的混合缓解策略；特别是将目标方差重缩放（TVR）与注意力KV偏置或动态Tanh（DyT）结合，在我们研究的场景中成功平衡了对大规模激活的缓解与下游模型性能的保持。我们的代码可在以下网址获取：https://github.com/bluorion-com/refine_massive_activations。

English

Motivated in part by their relevance for low-precision training and quantization, massive activations in large language models (LLMs) have recently emerged as a topic of interest. However, existing analyses are limited in scope, and generalizability across architectures is unclear. This paper helps address some of these gaps by conducting an analysis of massive activations across a broad range of LLMs, including both GLU-based and non-GLU-based architectures. Our findings challenge several prior assumptions, most importantly: (1) not all massive activations are detrimental, i.e. suppressing them does not lead to an explosion of perplexity or a collapse in downstream task performance; (2) proposed mitigation strategies such as Attention KV bias are model-specific and ineffective in certain cases. We consequently investigate novel hybrid mitigation strategies; in particular pairing Target Variance Rescaling (TVR) with Attention KV bias or Dynamic Tanh (DyT) successfully balances the mitigation of massive activations with preserved downstream model performance in the scenarios we investigated. Our code is available at: https://github.com/bluorion-com/refine_massive_activations.

大型語言模型中大量激活的細化分析

A Refined Analysis of Massive Activations in LLMs

摘要

Support