稀疏自編碼器在語言模型中發現高度可解釋的特徵

摘要

神經網絡內部理解的一個障礙是多義性，即神經元似乎在多個語義上激活。多義性阻礙我們確定神經網絡內部操作的簡潔、人類可理解的解釋。多義性的一個假設原因是超位置，神經網絡通過將特徵分配給激活空間中的一個過完備方向集，而不是單個神經元，來表示比神經元更多的特徵。在這裡，我們嘗試識別這些方向，使用稀疏自編碼器來重建語言模型的內部激活。這些自編碼器學習一組稀疏激活的特徵，比起其他方法識別的方向更易解釋且單義，其中可解釋性是通過自動化方法衡量的。刪除這些特徵可以實現精確的模型編輯，例如，通過刪除代詞預測等功能，而比以往的技術更少地干擾模型行為。這項工作表明，可以使用可擴展的非監督方法解決語言模型中的超位置。我們的方法可能成為未來機械解釋性工作的基礎，我們希望這將實現更大的模型透明度和可操控性。

English

One of the roadblocks to a better understanding of neural networks' internals is polysemanticity, where neurons appear to activate in multiple, semantically distinct contexts. Polysemanticity prevents us from identifying concise, human-understandable explanations for what neural networks are doing internally. One hypothesised cause of polysemanticity is superposition, where neural networks represent more features than they have neurons by assigning features to an overcomplete set of directions in activation space, rather than to individual neurons. Here, we attempt to identify those directions, using sparse autoencoders to reconstruct the internal activations of a language model. These autoencoders learn sets of sparsely activating features that are more interpretable and monosemantic than directions identified by alternative approaches, where interpretability is measured by automated methods. Ablating these features enables precise model editing, for example, by removing capabilities such as pronoun prediction, while disrupting model behaviour less than prior techniques. This work indicates that it is possible to resolve superposition in language models using a scalable, unsupervised method. Our method may serve as a foundation for future mechanistic interpretability work, which we hope will enable greater model transparency and steerability.

稀疏自編碼器在語言模型中發現高度可解釋的特徵

Sparse Autoencoders Find Highly Interpretable Features in Language Models

摘要

Support