반-음수 행렬 분해를 통해 다층 퍼셉트론 활성화를 해석 가능한 특징으로 분해하기

초록

기계적 해석 가능성(mechanistic interpretability)의 핵심 목표 중 하나는 대규모 언어 모델(LLM)의 출력을 인과적으로 설명할 수 있는 적절한 분석 단위를 식별하는 것이다. 초기 연구는 개별 뉴런에 초점을 맞췄으나, 뉴런이 종종 다중 개념을 인코딩한다는 증거로 인해 활성화 공간에서의 방향 분석으로 전환하게 되었다. 여기서 중요한 질문은 비지도 방식으로 해석 가능한 특징을 포착하는 방향을 어떻게 찾을 것인가이다. 현재의 방법들은 희소 자동인코더(SAE)를 이용한 사전 학습에 의존하며, 주로 잔차 스트림 활성화를 기반으로 방향을 처음부터 학습한다. 그러나 SAE는 인과적 평가에서 어려움을 겪으며, 모델의 계산과 명시적으로 연결되지 않아 본질적인 해석 가능성이 부족하다. 본 연구에서는 이러한 한계를 극복하기 위해 MLP 활성화를 반음수 행렬 분해(SNMF)를 통해 직접 분해하여, 학습된 특징이 (a) 동시 활성화된 뉴런들의 희소 선형 조합이며, (b) 이를 활성화하는 입력에 매핑되어 직접 해석 가능하도록 한다. Llama 3.1, Gemma 2 및 GPT-2에 대한 실험 결과, SNMF에서 도출된 특징들이 SAE와 강력한 지도 학습 기준(difference-in-means)을 능가하며, 인간이 해석 가능한 개념과 일치함을 보여준다. 추가 분석은 특정 뉴런 조합이 의미적으로 관련된 특징들 간에 재사용되며, MLP의 활성화 공간에서 계층적 구조가 드러남을 보여준다. 이러한 결과들은 SNMF가 해석 가능한 특징을 식별하고 LLM의 개념 표현을 분석하는 데 있어 간단하면서도 효과적인 도구임을 입증한다.

English

A central goal for mechanistic interpretability has been to identify the right units of analysis in large language models (LLMs) that causally explain their outputs. While early work focused on individual neurons, evidence that neurons often encode multiple concepts has motivated a shift toward analyzing directions in activation space. A key question is how to find directions that capture interpretable features in an unsupervised manner. Current methods rely on dictionary learning with sparse autoencoders (SAEs), commonly trained over residual stream activations to learn directions from scratch. However, SAEs often struggle in causal evaluations and lack intrinsic interpretability, as their learning is not explicitly tied to the computations of the model. Here, we tackle these limitations by directly decomposing MLP activations with semi-nonnegative matrix factorization (SNMF), such that the learned features are (a) sparse linear combinations of co-activated neurons, and (b) mapped to their activating inputs, making them directly interpretable. Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering, while aligning with human-interpretable concepts. Further analysis reveals that specific neuron combinations are reused across semantically-related features, exposing a hierarchical structure in the MLP's activation space. Together, these results position SNMF as a simple and effective tool for identifying interpretable features and dissecting concept representations in LLMs.

반-음수 행렬 분해를 통해 다층 퍼셉트론 활성화를 해석 가능한 특징으로 분해하기

Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

초록

Support