驯服大语言模型中的多义性：通过稀疏自编码器实现可证明的特征恢复

摘要

我们研究利用稀疏自编码器（SAEs）实现理论支撑的特征恢复，以解释大型语言模型（LLMs）所面临的挑战。现有的SAE训练算法往往缺乏严格的数学保证，并存在超参数敏感性和不稳定性等实际限制。为解决这些问题，我们首先提出了一种新颖的统计框架来处理特征恢复问题，该框架通过将多义特征建模为底层单义概念的稀疏混合，引入了一种新的特征可识别性概念。基于此框架，我们提出了一种基于“偏置自适应”的新SAE训练算法，该技术通过自适应调整神经网络偏置参数来确保适当的激活稀疏性。我们理论证明了，当输入数据来自我们提出的统计模型时，该算法能正确恢复所有单义特征。此外，我们开发了一种改进的实证变体——组偏置自适应（GBA），并展示了其在应用于参数高达15亿的LLMs时，相较于基准方法的优越性能。本工作通过提供首个具备理论恢复保证的SAE算法，为揭开SAE训练的神秘面纱迈出了基础性的一步，从而通过增强的机制可解释性，推动了更加透明和可信赖的AI系统的发展。

English

We study the challenge of achieving theoretically grounded feature recovery using Sparse Autoencoders (SAEs) for the interpretation of Large Language Models. Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations such as hyperparameter sensitivity and instability. To address these issues, we first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability by modeling polysemantic features as sparse mixtures of underlying monosemantic concepts. Building on this framework, we introduce a new SAE training algorithm based on ``bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity. We theoretically prove that this algorithm correctly recovers all monosemantic features when input data is sampled from our proposed statistical model. Furthermore, we develop an improved empirical variant, Group Bias Adaptation (GBA), and demonstrate its superior performance against benchmark methods when applied to LLMs with up to 1.5 billion parameters. This work represents a foundational step in demystifying SAE training by providing the first SAE algorithm with theoretical recovery guarantees, thereby advancing the development of more transparent and trustworthy AI systems through enhanced mechanistic interpretability.

驯服大语言模型中的多义性：通过稀疏自编码器实现可证明的特征恢复

Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders

摘要

Support