通过半非负矩阵分解将多层感知机激活分解为可解释特征

摘要

机制可解释性的核心目标在于识别大型语言模型（LLMs）中能够因果解释其输出的分析单元。早期研究聚焦于单个神经元，但随着发现神经元常编码多个概念，研究重心转向了激活空间中的方向分析。关键问题在于如何以无监督方式找到捕捉可解释特征的方向。现有方法依赖于稀疏自编码器（SAEs）的字典学习，通常基于残差流激活从头学习方向。然而，SAEs在因果评估中表现欠佳，且缺乏内在可解释性，因其学习过程未明确与模型计算绑定。本文通过半非负矩阵分解（SNMF）直接分解多层感知机（MLP）激活，克服了这些局限，使得学习到的特征既是（a）共激活神经元的稀疏线性组合，又（b）映射到其激活输入，从而直接具备可解释性。在Llama 3.1、Gemma 2及GPT-2上的实验表明，SNMF提取的特征在因果操控上优于SAEs及一个强监督基线（均值差异），并与人类可解释概念相吻合。进一步分析揭示，特定神经元组合在语义相关特征间被重复利用，暴露了MLP激活空间中的层次结构。这些结果共同确立了SNMF作为一种简单有效的工具，用于识别可解释特征并剖析LLMs中的概念表示。

English

A central goal for mechanistic interpretability has been to identify the right units of analysis in large language models (LLMs) that causally explain their outputs. While early work focused on individual neurons, evidence that neurons often encode multiple concepts has motivated a shift toward analyzing directions in activation space. A key question is how to find directions that capture interpretable features in an unsupervised manner. Current methods rely on dictionary learning with sparse autoencoders (SAEs), commonly trained over residual stream activations to learn directions from scratch. However, SAEs often struggle in causal evaluations and lack intrinsic interpretability, as their learning is not explicitly tied to the computations of the model. Here, we tackle these limitations by directly decomposing MLP activations with semi-nonnegative matrix factorization (SNMF), such that the learned features are (a) sparse linear combinations of co-activated neurons, and (b) mapped to their activating inputs, making them directly interpretable. Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering, while aligning with human-interpretable concepts. Further analysis reveals that specific neuron combinations are reused across semantically-related features, exposing a hierarchical structure in the MLP's activation space. Together, these results position SNMF as a simple and effective tool for identifying interpretable features and dissecting concept representations in LLMs.

通过半非负矩阵分解将多层感知机激活分解为可解释特征

Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

摘要

Support