Saltare Avanti: Migliorare la Fedeltà della Ricostruzione con Autoencoder Sparse JumpReLU

Abstract

Gli autoencoder sparsi (SAE) rappresentano un approccio promettente e non supervisionato per identificare caratteristiche lineari causalmente rilevanti e interpretabili nelle attivazioni di un modello linguistico (LM). Per essere utili nei task downstream, gli SAE devono scomporre fedelmente le attivazioni del LM; tuttavia, per essere interpretabili, la scomposizione deve essere sparsa — due obiettivi che sono in tensione. In questo articolo, introduciamo gli JumpReLU SAE, che raggiungono una fedeltà di ricostruzione allo stato dell'arte a un determinato livello di sparsità sulle attivazioni di Gemma 2 9B, rispetto ad altri progressi recenti come gli SAE Gated e TopK. Mostriamo inoltre che questo miglioramento non avviene a scapito dell'interpretabilità, attraverso studi di interpretabilità manuali e automatizzati. Gli JumpReLU SAE sono una semplice modifica degli SAE vanilla (ReLU) — in cui sostituiamo la funzione di attivazione ReLU con una funzione JumpReLU discontinua — e sono altrettanto efficienti da addestrare e far funzionare. Utilizzando in modo rigoroso gli stimatori straight-through (STEs), dimostriamo come sia possibile addestrare efficacemente gli JumpReLU SAE nonostante la funzione JumpReLU discontinua introdotta nel passaggio in avanti dell'SAE. Allo stesso modo, utilizziamo gli STEs per addestrare direttamente L0 a essere sparsa, invece di addestrare su proxy come L1, evitando problemi come il restringimento.

English

Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM) activations. To be useful for downstream tasks, SAEs need to decompose LM activations faithfully; yet to be interpretable the decomposition must be sparse -- two objectives that are in tension. In this paper, we introduce JumpReLU SAEs, which achieve state-of-the-art reconstruction fidelity at a given sparsity level on Gemma 2 9B activations, compared to other recent advances such as Gated and TopK SAEs. We also show that this improvement does not come at the cost of interpretability through manual and automated interpretability studies. JumpReLU SAEs are a simple modification of vanilla (ReLU) SAEs -- where we replace the ReLU with a discontinuous JumpReLU activation function -- and are similarly efficient to train and run. By utilising straight-through-estimators (STEs) in a principled manner, we show how it is possible to train JumpReLU SAEs effectively despite the discontinuous JumpReLU function introduced in the SAE's forward pass. Similarly, we use STEs to directly train L0 to be sparse, instead of training on proxies such as L1, avoiding problems like shrinkage.

Saltare Avanti: Migliorare la Fedeltà della Ricostruzione con Autoencoder Sparse JumpReLU

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Abstract

Support