透過半非負矩陣分解將多層感知器激活值分解為可解釋特徵

摘要

机制解释性的一个核心目标在于识别大型语言模型（LLMs）中能够因果解释其输出的恰当分析单元。早期研究集中于单个神经元，然而，神经元常编码多个概念的证据促使研究转向分析激活空间中的方向。关键问题是如何以无监督的方式找到捕捉可解释特征的方向。现有方法依赖于使用稀疏自编码器（SAEs）进行字典学习，通常基于残差流激活从头学习方向。然而，SAEs在因果评估中常遇挑战，且缺乏内在可解释性，因其学习过程并未明确与模型的计算相绑定。本文通过采用半非负矩阵分解（SNMF）直接分解多层感知机（MLP）激活，克服了这些局限，使得学习到的特征既是共激活神经元的稀疏线性组合，又映射至其激活输入，从而直接具备可解释性。在Llama 3.1、Gemma 2及GPT-2上的实验表明，SNMF提取的特征在因果引导上优于SAEs及一个强监督基线（均值差异），同时与人类可解释概念相吻合。进一步分析揭示，特定神经元组合在语义相关特征间被重复利用，揭示了MLP激活空间中的层次结构。综上所述，这些结果确立了SNMF作为一种简单而有效的工具，用于识别可解释特征并剖析LLMs中的概念表示。

English

A central goal for mechanistic interpretability has been to identify the right units of analysis in large language models (LLMs) that causally explain their outputs. While early work focused on individual neurons, evidence that neurons often encode multiple concepts has motivated a shift toward analyzing directions in activation space. A key question is how to find directions that capture interpretable features in an unsupervised manner. Current methods rely on dictionary learning with sparse autoencoders (SAEs), commonly trained over residual stream activations to learn directions from scratch. However, SAEs often struggle in causal evaluations and lack intrinsic interpretability, as their learning is not explicitly tied to the computations of the model. Here, we tackle these limitations by directly decomposing MLP activations with semi-nonnegative matrix factorization (SNMF), such that the learned features are (a) sparse linear combinations of co-activated neurons, and (b) mapped to their activating inputs, making them directly interpretable. Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering, while aligning with human-interpretable concepts. Further analysis reveals that specific neuron combinations are reused across semantically-related features, exposing a hierarchical structure in the MLP's activation space. Together, these results position SNMF as a simple and effective tool for identifying interpretable features and dissecting concept representations in LLMs.

透過半非負矩陣分解將多層感知器激活值分解為可解釋特徵

Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

摘要

Support