기계적 순열성: 층 간 특징 일치

초록

딥 신경망에서 특징이 계층별로 어떻게 진화하는지 이해하는 것은 기계적 해석 가능성에서의 근본적인 과제입니다. 특히 다의성과 특징 중첩 때문에 어렵습니다. 희소 오토인코더(SAEs)는 개별 계층에서 해석 가능한 특징을 추출하는 데 사용되었지만, 이러한 특징을 계층 간에 정렬하는 것은 여전히 열린 문제입니다. 본 논문에서는 SAE Match를 소개합니다. 이는 신경망의 서로 다른 계층 간에 SAE 특징을 정렬하기 위한 혁신적인 데이터 없는 방법입니다. 접근 방식은 SAE의 접힌 매개변수 사이의 평균 제곱 오차를 최소화하여 특징을 일치시키는 것을 포함합니다. 이 기술은 특징의 척도 차이를 고려하기 위해 활성화 임계값을 인코더와 디코더 가중치에 통합합니다. Gemma 2 언어 모델에서의 광범위한 실험을 통해, 우리의 방법이 효과적으로 계층 간의 특징 진화를 포착하며 특징 일치 품질을 향상시킨다는 것을 입증합니다. 또한 특징이 여러 계층에 걸쳐 지속되고 우리의 접근 방식이 계층 간에 숨겨진 상태를 근사할 수 있다는 것을 보여줍니다. 우리의 연구는 신경망에서 특징 역학을 이해하는 데 기여하며 기계적 해석 가능성 연구를 위한 새로운 도구를 제공합니다.

English

Understanding how features evolve across layers in deep neural networks is a fundamental challenge in mechanistic interpretability, particularly due to polysemanticity and feature superposition. While Sparse Autoencoders (SAEs) have been used to extract interpretable features from individual layers, aligning these features across layers has remained an open problem. In this paper, we introduce SAE Match, a novel, data-free method for aligning SAE features across different layers of a neural network. Our approach involves matching features by minimizing the mean squared error between the folded parameters of SAEs, a technique that incorporates activation thresholds into the encoder and decoder weights to account for differences in feature scales. Through extensive experiments on the Gemma 2 language model, we demonstrate that our method effectively captures feature evolution across layers, improving feature matching quality. We also show that features persist over several layers and that our approach can approximate hidden states across layers. Our work advances the understanding of feature dynamics in neural networks and provides a new tool for mechanistic interpretability studies.

기계적 순열성: 층 간 특징 일치

Mechanistic Permutability: Match Features Across Layers

초록

Support