馴服大型語言模型中的多義性：透過稀疏自編碼器實現可證明的特徵恢復

摘要

我們研究利用稀疏自編碼器（SAEs）實現理論基礎的特徵恢復，以解釋大型語言模型所面臨的挑戰。現有的SAE訓練算法往往缺乏嚴謹的數學保證，並存在超參數敏感性和不穩定性等實際限制。為解決這些問題，我們首先提出了一個新穎的統計框架來處理特徵恢復問題，該框架通過將多義特徵建模為底層單義概念的稀疏混合，引入了一種新的特徵可識別性概念。基於此框架，我們提出了一種基於「偏差適應」的新SAE訓練算法，該技術通過自適應調整神經網絡的偏差參數來確保適當的激活稀疏性。我們從理論上證明，當輸入數據來自我們提出的統計模型時，該算法能夠正確恢復所有單義特徵。此外，我們開發了一種改進的實證變體——群組偏差適應（GBA），並展示了其在應用於參數高達15億的大型語言模型時相較於基準方法的優越性能。這項工作通過提供首個具有理論恢復保證的SAE算法，為揭示SAE訓練的奧秘邁出了基礎性的一步，從而通過增強機制可解釋性推動了更透明、可信賴的人工智能系統的發展。

English

We study the challenge of achieving theoretically grounded feature recovery using Sparse Autoencoders (SAEs) for the interpretation of Large Language Models. Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations such as hyperparameter sensitivity and instability. To address these issues, we first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability by modeling polysemantic features as sparse mixtures of underlying monosemantic concepts. Building on this framework, we introduce a new SAE training algorithm based on ``bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity. We theoretically prove that this algorithm correctly recovers all monosemantic features when input data is sampled from our proposed statistical model. Furthermore, we develop an improved empirical variant, Group Bias Adaptation (GBA), and demonstrate its superior performance against benchmark methods when applied to LLMs with up to 1.5 billion parameters. This work represents a foundational step in demystifying SAE training by providing the first SAE algorithm with theoretical recovery guarantees, thereby advancing the development of more transparent and trustworthy AI systems through enhanced mechanistic interpretability.

馴服大型語言模型中的多義性：透過稀疏自編碼器實現可證明的特徵恢復

Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders

摘要

Support