大規模言語モデルの活性化の生成メタモデルの学習

要旨

ニューラルネットワークの活性化の分析における既存のアプローチ（PCAやスパースオートエンコーダなど）は、強い構造的仮定に依存している。生成モデルはこれとは異なる選択肢を提供する。つまり、そのような仮定なしに構造を発見でき、介入の忠実度を改善する事前分布として機能するのである。我々はこの方向性を探求するため、10億の残差ストリーム活性化に対して拡散モデルを学習し、ネットワークの内部状態の分布を学習する「メタモデル」を構築した。その結果、拡散損失は計算量に応じて滑らかに減少し、下流任務の有用性を確実に予測することがわかった。特に、メタモデルが学習した事前分布をステアリング介入に適用すると流暢性が向上し、損失が減少するほど改善幅が大きくなった。さらに、メタモデルのニューロンは概念を次第に個々のユニットに分離し、損失の減少に比例してスパースプロービングスコアが向上した。これらの結果は、生成的なメタモデルが制限的な構造的仮定なしに解釈可能性に向けたスケーラブルな道筋を提供することを示唆している。プロジェクトページ: https://generative-latent-prior.github.io。

English

Existing approaches for analyzing neural network activations, such as PCA and sparse autoencoders, rely on strong structural assumptions. Generative models offer an alternative: they can uncover structure without such assumptions and act as priors that improve intervention fidelity. We explore this direction by training diffusion models on one billion residual stream activations, creating "meta-models" that learn the distribution of a network's internal states. We find that diffusion loss decreases smoothly with compute and reliably predicts downstream utility. In particular, applying the meta-model's learned prior to steering interventions improves fluency, with larger gains as loss decreases. Moreover, the meta-model's neurons increasingly isolate concepts into individual units, with sparse probing scores that scale as loss decreases. These results suggest generative meta-models offer a scalable path toward interpretability without restrictive structural assumptions. Project page: https://generative-latent-prior.github.io.

大規模言語モデルの活性化の生成メタモデルの学習

Learning a Generative Meta-Model of LLM Activations

要旨

Support