LLMにおける多義性の制御：スパースオートエンコーダーによる証明可能な特徴回復

要旨

大規模言語モデルの解釈におけるスパースオートエンコーダ（SAE）を用いた理論的基盤に基づく特徴量回復の課題を研究する。既存のSAE学習アルゴリズムは、厳密な数学的保証を欠き、ハイパーパラメータの感度や不安定性といった実用的な制約に悩まされている。これらの問題に対処するため、まず、多義的特徴を基盤となる単義的概念のスパースな混合としてモデル化する新しい特徴量同定性の概念を含む、特徴量回復問題のための新たな統計的フレームワークを提案する。このフレームワークに基づき、「バイアス適応」と呼ばれる技術を用いた新しいSAE学習アルゴリズムを導入する。この技術は、適切な活性化スパース性を確保するためにニューラルネットワークのバイアスパラメータを適応的に調整するものである。提案した統計モデルから入力データがサンプリングされた場合、このアルゴリズムがすべての単義的特徴を正しく回復することを理論的に証明する。さらに、改良された実証的バリアントであるグループバイアス適応（GBA）を開発し、最大15億パラメータの大規模言語モデルに適用した際のベンチマーク手法に対する優れた性能を実証する。本研究は、理論的回復保証を備えた初めてのSAEアルゴリズムを提供することで、SAE学習の謎を解明する基礎的な一歩を踏み出し、機構的解釈可能性の向上を通じてより透明で信頼性の高いAIシステムの開発を推進するものである。

English

We study the challenge of achieving theoretically grounded feature recovery using Sparse Autoencoders (SAEs) for the interpretation of Large Language Models. Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations such as hyperparameter sensitivity and instability. To address these issues, we first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability by modeling polysemantic features as sparse mixtures of underlying monosemantic concepts. Building on this framework, we introduce a new SAE training algorithm based on ``bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity. We theoretically prove that this algorithm correctly recovers all monosemantic features when input data is sampled from our proposed statistical model. Furthermore, we develop an improved empirical variant, Group Bias Adaptation (GBA), and demonstrate its superior performance against benchmark methods when applied to LLMs with up to 1.5 billion parameters. This work represents a foundational step in demystifying SAE training by providing the first SAE algorithm with theoretical recovery guarantees, thereby advancing the development of more transparent and trustworthy AI systems through enhanced mechanistic interpretability.

LLMにおける多義性の制御：スパースオートエンコーダーによる証明可能な特徴回復

Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders

要旨

Support