向前跳跃：利用JumpReLU稀疏自动编码器改善重建保真度

摘要

稀疏自编码器（SAEs）是一种有前景的无监督方法，用于识别语言模型（LM）激活中因果相关且可解释的线性特征。为了在下游任务中发挥作用，SAEs需要忠实地分解LM激活；然而，为了可解释性，分解必须是稀疏的，这两个目标存在紧张关系。在本文中，我们介绍了JumpReLU SAEs，相较于其他最近的进展，如门控和TopK SAEs，在给定稀疏度水平上实现了Gemini 2 9B激活的最先进重构保真度。我们还展示了这种改进并没有以可解释性为代价，通过手动和自动可解释性研究。JumpReLU SAEs是对普通（ReLU）SAEs的简单修改，我们将ReLU替换为不连续的JumpReLU激活函数，并且训练和运行效率类似。通过以原则性方式利用直通估计器（STEs），我们展示了如何有效训练JumpReLU SAEs，尽管在SAE的前向传播中引入了不连续的JumpReLU函数。同样，我们使用STEs直接训练L0以实现稀疏性，而不是训练在L1等代理上，避免像收缩这样的问题。

English

Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM) activations. To be useful for downstream tasks, SAEs need to decompose LM activations faithfully; yet to be interpretable the decomposition must be sparse -- two objectives that are in tension. In this paper, we introduce JumpReLU SAEs, which achieve state-of-the-art reconstruction fidelity at a given sparsity level on Gemma 2 9B activations, compared to other recent advances such as Gated and TopK SAEs. We also show that this improvement does not come at the cost of interpretability through manual and automated interpretability studies. JumpReLU SAEs are a simple modification of vanilla (ReLU) SAEs -- where we replace the ReLU with a discontinuous JumpReLU activation function -- and are similarly efficient to train and run. By utilising straight-through-estimators (STEs) in a principled manner, we show how it is possible to train JumpReLU SAEs effectively despite the discontinuous JumpReLU function introduced in the SAE's forward pass. Similarly, we use STEs to directly train L0 to be sparse, instead of training on proxies such as L1, avoiding problems like shrinkage.

向前跳跃：利用JumpReLU稀疏自动编码器改善重建保真度

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

摘要

Support