向前跳跃:利用JumpReLU稀疏自动编码器改善重建保真度
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
July 19, 2024
作者: Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, Neel Nanda
cs.AI
摘要
稀疏自编码器(SAEs)是一种有前景的无监督方法,用于识别语言模型(LM)激活中因果相关且可解释的线性特征。为了在下游任务中发挥作用,SAEs需要忠实地分解LM激活;然而,为了可解释性,分解必须是稀疏的,这两个目标存在紧张关系。在本文中,我们介绍了JumpReLU SAEs,相较于其他最近的进展,如门控和TopK SAEs,在给定稀疏度水平上实现了Gemini 2 9B激活的最先进重构保真度。我们还展示了这种改进并没有以可解释性为代价,通过手动和自动可解释性研究。JumpReLU SAEs是对普通(ReLU)SAEs的简单修改,我们将ReLU替换为不连续的JumpReLU激活函数,并且训练和运行效率类似。通过以原则性方式利用直通估计器(STEs),我们展示了如何有效训练JumpReLU SAEs,尽管在SAE的前向传播中引入了不连续的JumpReLU函数。同样,我们使用STEs直接训练L0以实现稀疏性,而不是训练在L1等代理上,避免像收缩这样的问题。
English
Sparse autoencoders (SAEs) are a promising unsupervised approach for
identifying causally relevant and interpretable linear features in a language
model's (LM) activations. To be useful for downstream tasks, SAEs need to
decompose LM activations faithfully; yet to be interpretable the decomposition
must be sparse -- two objectives that are in tension. In this paper, we
introduce JumpReLU SAEs, which achieve state-of-the-art reconstruction fidelity
at a given sparsity level on Gemma 2 9B activations, compared to other recent
advances such as Gated and TopK SAEs. We also show that this improvement does
not come at the cost of interpretability through manual and automated
interpretability studies. JumpReLU SAEs are a simple modification of vanilla
(ReLU) SAEs -- where we replace the ReLU with a discontinuous JumpReLU
activation function -- and are similarly efficient to train and run. By
utilising straight-through-estimators (STEs) in a principled manner, we show
how it is possible to train JumpReLU SAEs effectively despite the discontinuous
JumpReLU function introduced in the SAE's forward pass. Similarly, we use STEs
to directly train L0 to be sparse, instead of training on proxies such as L1,
avoiding problems like shrinkage.Summary
AI-Generated Summary