稀疏自动编码器在语言模型中发现高度可解释的特征

摘要

神经网络内部理解的一个障碍是多义性，即神经元在多个语义上激活。多义性阻碍了我们找到简洁、人类可理解的解释来解释神经网络内部的运作。多义性的一个假设原因是叠加效应，即神经网络通过将特征分配给激活空间中的一个过完备方向集，而不是单个神经元，来表示比神经元更多的特征。在这里，我们尝试识别这些方向，使用稀疏自编码器重建语言模型的内部激活。这些自编码器学习一组稀疏激活特征，比替代方法识别的方向更具可解释性和单一语义性，其中可解释性是通过自动化方法衡量的。消除这些特征可以实现精确的模型编辑，例如通过去除代词预测等功能，同时比先前的技术更少地干扰模型行为。这项工作表明，可以使用可扩展的无监督方法解决语言模型中的叠加效应。我们的方法可能成为未来机械可解释性工作的基础，我们希望这将实现更大的模型透明度和可操纵性。

English

One of the roadblocks to a better understanding of neural networks' internals is polysemanticity, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is superposition, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Ablating these features enables precise model editing, for example, by removing capabilities such as pronoun prediction, while disrupting model behaviour less than prior techniques. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.

稀疏自动编码器在语言模型中发现高度可解释的特征

Sparse Autoencoders Find Highly Interpretable Features in Language Models

摘要

Support