Permutabilità Meccanicistica: Abbinare le Caratteristiche Attraverso i Livelli

Abstract

Comprendere come le caratteristiche evolvono attraverso i livelli nelle reti neurali profonde è una sfida fondamentale nell'interpretabilità meccanicistica, particolarmente a causa della polisemanticità e della sovrapposizione delle caratteristiche. Mentre gli Autoencoder Sparsi (SAE) sono stati utilizzati per estrarre caratteristiche interpretabili dai singoli livelli, allineare queste caratteristiche tra i livelli è rimasto un problema aperto. In questo articolo, presentiamo SAE Match, un nuovo metodo privo di dati per allineare le caratteristiche SAE attraverso diversi livelli di una rete neurale. Il nostro approccio coinvolge il matching delle caratteristiche minimizzando l'errore quadratico medio tra i parametri piegati degli SAE, una tecnica che incorpora soglie di attivazione nei pesi dell'encoder e del decoder per tener conto delle differenze nelle scale delle caratteristiche. Attraverso estesi esperimenti sul modello linguistico Gemma 2, dimostriamo che il nostro metodo cattura efficacemente l'evoluzione delle caratteristiche tra i livelli, migliorando la qualità del matching delle caratteristiche. Mostriamo inoltre che le caratteristiche persistono per diversi livelli e che il nostro approccio può approssimare gli stati nascosti tra i livelli. Il nostro lavoro fa progredire la comprensione della dinamica delle caratteristiche nelle reti neurali e fornisce un nuovo strumento per gli studi di interpretabilità meccanicistica.

English

Understanding how features evolve across layers in deep neural networks is a fundamental challenge in mechanistic interpretability, particularly due to polysemanticity and feature superposition. While Sparse Autoencoders (SAEs) have been used to extract interpretable features from individual layers, aligning these features across layers has remained an open problem. In this paper, we introduce SAE Match, a novel, data-free method for aligning SAE features across different layers of a neural network. Our approach involves matching features by minimizing the mean squared error between the folded parameters of SAEs, a technique that incorporates activation thresholds into the encoder and decoder weights to account for differences in feature scales. Through extensive experiments on the Gemma 2 language model, we demonstrate that our method effectively captures feature evolution across layers, improving feature matching quality. We also show that features persist over several layers and that our approach can approximate hidden states across layers. Our work advances the understanding of feature dynamics in neural networks and provides a new tool for mechanistic interpretability studies.

Permutabilità Meccanicistica: Abbinare le Caratteristiche Attraverso i Livelli

Mechanistic Permutability: Match Features Across Layers

Abstract

Support