向前跳躍：使用 JumpReLU 稀疏自編碼器改善重建保真度

摘要

稀疏自編碼器（SAEs）是一種有前途的非監督方法，用於識別語言模型（LM）激活中具有因果關係且可解釋的線性特徵。為了在下游任務中有用，SAEs需要忠實地分解LM激活；然而，為了可解釋，分解必須是稀疏的--這兩個目標之間存在張力。在本文中，我們介紹JumpReLU SAEs，相較於最近的其他進展如閘控和TopK SAEs，在Gemma 2 9B激活上實現了特定稀疏水平的最先進的重建保真度。我們還展示這種改進並不是以可解釋性為代價，通過手動和自動可解釋性研究。JumpReLU SAEs是對普通（ReLU）SAEs的簡單修改--我們將ReLU替換為不連續的JumpReLU激活函數--並且訓練和運行效率相似。通過以原則性方式利用直通估計器（STEs），我們展示了如何有效地訓練JumpReLU SAEs，儘管在SAE的前向傳遞中引入了不連續的JumpReLU函數。同樣地，我們使用STEs直接訓練L0以實現稀疏性，而不是在像L1這樣的代理上進行訓練，避免像收縮這樣的問題。

English

Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM) activations. To be useful for downstream tasks, SAEs need to decompose LM activations faithfully; yet to be interpretable the decomposition must be sparse -- two objectives that are in tension. In this paper, we introduce JumpReLU SAEs, which achieve state-of-the-art reconstruction fidelity at a given sparsity level on Gemma 2 9B activations, compared to other recent advances such as Gated and TopK SAEs. We also show that this improvement does not come at the cost of interpretability through manual and automated interpretability studies. JumpReLU SAEs are a simple modification of vanilla (ReLU) SAEs -- where we replace the ReLU with a discontinuous JumpReLU activation function -- and are similarly efficient to train and run. By utilising straight-through-estimators (STEs) in a principled manner, we show how it is possible to train JumpReLU SAEs effectively despite the discontinuous JumpReLU function introduced in the SAE's forward pass. Similarly, we use STEs to directly train L0 to be sparse, instead of training on proxies such as L1, avoiding problems like shrinkage.

向前跳躍：使用 JumpReLU 稀疏自編碼器改善重建保真度

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

摘要

Support