半非負行列分解によるMLP活性化の解釈可能な特徴への分解

要旨

メカニズム的解釈可能性の中心的な目標は、大規模言語モデル（LLM）の出力を因果的に説明する適切な分析単位を特定することである。初期の研究は個々のニューロンに焦点を当てていたが、ニューロンがしばしば複数の概念を符号化するという証拠から、活性化空間における方向性の分析へとシフトが進んでいる。重要な課題は、教師なしの方法で解釈可能な特徴を捉える方向性を見つけることである。現在の手法は、スパースオートエンコーダ（SAE）を用いた辞書学習に依存しており、通常は残差ストリームの活性化を基に方向性をゼロから学習する。しかし、SAEは因果的評価においてしばしば困難を抱え、その学習がモデルの計算に明示的に結びついていないため、本質的な解釈可能性を欠いている。本研究では、これらの制約を克服するために、MLPの活性化を半非負値行列因子分解（SNMF）によって直接分解し、学習された特徴が（a）共活性化ニューロンのスパースな線形結合であり、（b）それらを活性化する入力にマッピングされるようにすることで、直接的に解釈可能な特徴を導出する。Llama 3.1、Gemma 2、GPT-2を用いた実験では、SNMFによって導出された特徴が、因果的ステアリングにおいてSAEおよび強力な教師ありベースライン（平均差）を上回り、人間が解釈可能な概念と整合することが示された。さらに、特定のニューロン組み合わせが意味的に関連する特徴間で再利用されていることが明らかになり、MLPの活性化空間における階層構造が明らかになった。これらの結果から、SNMFはLLMにおける解釈可能な特徴を特定し、概念表現を解剖するためのシンプルで効果的なツールとして位置づけられる。

English

A central goal for mechanistic interpretability has been to identify the right units of analysis in large language models (LLMs) that causally explain their outputs. While early work focused on individual neurons, evidence that neurons often encode multiple concepts has motivated a shift toward analyzing directions in activation space. A key question is how to find directions that capture interpretable features in an unsupervised manner. Current methods rely on dictionary learning with sparse autoencoders (SAEs), commonly trained over residual stream activations to learn directions from scratch. However, SAEs often struggle in causal evaluations and lack intrinsic interpretability, as their learning is not explicitly tied to the computations of the model. Here, we tackle these limitations by directly decomposing MLP activations with semi-nonnegative matrix factorization (SNMF), such that the learned features are (a) sparse linear combinations of co-activated neurons, and (b) mapped to their activating inputs, making them directly interpretable. Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering, while aligning with human-interpretable concepts. Further analysis reveals that specific neuron combinations are reused across semantically-related features, exposing a hierarchical structure in the MLP's activation space. Together, these results position SNMF as a simple and effective tool for identifying interpretable features and dissecting concept representations in LLMs.

半非負行列分解によるMLP活性化の解釈可能な特徴への分解

Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

要旨

Support