稀疏自动编码器在语言模型中发现高度可解释的特征
Sparse Autoencoders Find Highly Interpretable Features in Language Models
September 15, 2023
作者: Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey
cs.AI
摘要
神经网络内部理解的一个障碍是多义性,即神经元在多个语义上激活。多义性阻碍了我们找到简洁、人类可理解的解释来解释神经网络内部的运作。多义性的一个假设原因是叠加效应,即神经网络通过将特征分配给激活空间中的一个过完备方向集,而不是单个神经元,来表示比神经元更多的特征。在这里,我们尝试识别这些方向,使用稀疏自编码器重建语言模型的内部激活。这些自编码器学习一组稀疏激活特征,比替代方法识别的方向更具可解释性和单一语义性,其中可解释性是通过自动化方法衡量的。消除这些特征可以实现精确的模型编辑,例如通过去除代词预测等功能,同时比先前的技术更少地干扰模型行为。这项工作表明,可以使用可扩展的无监督方法解决语言模型中的叠加效应。我们的方法可能成为未来机械可解释性工作的基础,我们希望这将实现更大的模型透明度和可操纵性。
English
One of the roadblocks to a better understanding of neural networks' internals
is polysemanticity, where neurons appear to activate in multiple,
semantically distinct contexts. Polysemanticity prevents us from identifying
concise, human-understandable explanations for what neural networks are doing
internally. One hypothesised cause of polysemanticity is
superposition, where neural networks represent more features than they
have neurons by assigning features to an overcomplete set of directions in
activation space, rather than to individual neurons. Here, we attempt to
identify those directions, using sparse autoencoders to reconstruct the
internal activations of a language model. These autoencoders learn sets of
sparsely activating features that are more interpretable and monosemantic than
directions identified by alternative approaches, where interpretability is
measured by automated methods. Ablating these features enables precise model
editing, for example, by removing capabilities such as pronoun prediction,
while disrupting model behaviour less than prior techniques. This work
indicates that it is possible to resolve superposition in language models using
a scalable, unsupervised method. Our method may serve as a foundation for
future mechanistic interpretability work, which we hope will enable greater
model transparency and steerability.