机械排列性：跨层匹配特征

摘要

在深度神经网络中理解特征在不同层之间如何演变是机械解释可解释性中的一个基本挑战，尤其是由于多义性和特征叠加。虽然稀疏自编码器（SAEs）已被用于从各个层中提取可解释特征，但跨层对齐这些特征仍然是一个未解决的问题。在本文中，我们介绍了SAE Match，这是一种新颖的、无需数据的方法，用于对齐神经网络不同层中的SAE特征。我们的方法涉及通过最小化SAE的折叠参数之间的均方误差来匹配特征，这一技术将激活阈值纳入编码器和解码器权重中，以考虑特征尺度的差异。通过对Gemma 2语言模型进行大量实验，我们展示了我们的方法有效地捕捉了跨层的特征演变，提高了特征匹配的质量。我们还表明特征在多个层中持续存在，并且我们的方法可以近似跨层的隐藏状态。我们的工作推动了对神经网络中特征动态的理解，并为机械解释可解释性研究提供了一种新工具。

English

Understanding how features evolve across layers in deep neural networks is a fundamental challenge in mechanistic interpretability, particularly due to polysemanticity and feature superposition. While Sparse Autoencoders (SAEs) have been used to extract interpretable features from individual layers, aligning these features across layers has remained an open problem. In this paper, we introduce SAE Match, a novel, data-free method for aligning SAE features across different layers of a neural network. Our approach involves matching features by minimizing the mean squared error between the folded parameters of SAEs, a technique that incorporates activation thresholds into the encoder and decoder weights to account for differences in feature scales. Through extensive experiments on the Gemma 2 language model, we demonstrate that our method effectively captures feature evolution across layers, improving feature matching quality. We also show that features persist over several layers and that our approach can approximate hidden states across layers. Our work advances the understanding of feature dynamics in neural networks and provides a new tool for mechanistic interpretability studies.

机械排列性：跨层匹配特征

Mechanistic Permutability: Match Features Across Layers

摘要

Support