機制置換性：跨層匹配特徵

摘要

在深度神經網絡中理解特徵如何跨層演變是機械解釋性中的一個基本挑戰，特別是由於多義性和特徵重疊。儘管稀疏自編碼器（SAEs）已被用於從單個層中提取可解釋的特徵，但跨層對齊這些特徵仍然是一個未解決的問題。在本文中，我們介紹了SAE Match，這是一種新穎的、無需數據的方法，用於對齊神經網絡中不同層的SAE特徵。我們的方法涉及通過最小化SAEs的折疊參數之間的均方誤差來匹配特徵，這種技術將激活閾值納入編碼器和解碼器權重中，以考慮特徵尺度的差異。通過對Gemma 2語言模型進行大量實驗，我們展示了我們的方法有效地捕捉了跨層的特徵演變，提高了特徵匹配的質量。我們還表明，特徵在幾個層上持續存在，我們的方法可以近似跨層的隱藏狀態。我們的工作推動了對神經網絡中特徵動態的理解，並為機械解釋性研究提供了一個新工具。

English

Understanding how features evolve across layers in deep neural networks is a fundamental challenge in mechanistic interpretability, particularly due to polysemanticity and feature superposition. While Sparse Autoencoders (SAEs) have been used to extract interpretable features from individual layers, aligning these features across layers has remained an open problem. In this paper, we introduce SAE Match, a novel, data-free method for aligning SAE features across different layers of a neural network. Our approach involves matching features by minimizing the mean squared error between the folded parameters of SAEs, a technique that incorporates activation thresholds into the encoder and decoder weights to account for differences in feature scales. Through extensive experiments on the Gemma 2 language model, we demonstrate that our method effectively captures feature evolution across layers, improving feature matching quality. We also show that features persist over several layers and that our approach can approximate hidden states across layers. Our work advances the understanding of feature dynamics in neural networks and provides a new tool for mechanistic interpretability studies.

機制置換性：跨層匹配特徵

Mechanistic Permutability: Match Features Across Layers

摘要

Support