機械的置換性：層を横断して特徴をマッチング

要旨

ディープニューラルネットワーク内の特徴が層を超えてどのように進化するかを理解することは、機械的解釈可能性における基本的な課題であり、特に多義性と特徴の重畳のために特に重要です。Sparse Autoencoders（SAEs）は、個々の層から解釈可能な特徴を抽出するために使用されてきましたが、これらの特徴を層を超えて整列させることは未解決の課題でした。本論文では、ニューラルネットワークの異なる層間でSAEの特徴を整列させるための革新的なデータフリーメソッドであるSAE Matchを紹介します。当該手法は、SAEの折りたたまれたパラメータ間の平均二乗誤差を最小化することにより特徴を整列させることを含みます。この手法は、特徴のスケールの違いを考慮するために、エンコーダとデコーダの重みに活性化閾値を組み込む技術です。Gemma 2言語モデルでの幅広い実験を通じて、当該手法が効果的に層を超えた特徴の進化を捉え、特徴の整列の品質を向上させることを示します。また、特徴が数層にわたって持続し、当該手法が層を超えて隠れた状態を近似できることも示します。本研究は、ニューラルネットワークにおける特徴のダイナミクスの理解を推進し、機械的解釈可能性研究のための新しいツールを提供しています。

English

Understanding how features evolve across layers in deep neural networks is a fundamental challenge in mechanistic interpretability, particularly due to polysemanticity and feature superposition. While Sparse Autoencoders (SAEs) have been used to extract interpretable features from individual layers, aligning these features across layers has remained an open problem. In this paper, we introduce SAE Match, a novel, data-free method for aligning SAE features across different layers of a neural network. Our approach involves matching features by minimizing the mean squared error between the folded parameters of SAEs, a technique that incorporates activation thresholds into the encoder and decoder weights to account for differences in feature scales. Through extensive experiments on the Gemma 2 language model, we demonstrate that our method effectively captures feature evolution across layers, improving feature matching quality. We also show that features persist over several layers and that our approach can approximate hidden states across layers. Our work advances the understanding of feature dynamics in neural networks and provides a new tool for mechanistic interpretability studies.

機械的置換性：層を横断して特徴をマッチング

Mechanistic Permutability: Match Features Across Layers

要旨

Support