稀疏自編碼器在語言模型中發現高度可解釋的特徵
Sparse Autoencoders Find Highly Interpretable Features in Language Models
September 15, 2023
作者: Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey
cs.AI
摘要
神經網絡內部理解的一個障礙是多義性,即神經元似乎在多個語義上激活。多義性阻礙我們確定神經網絡內部操作的簡潔、人類可理解的解釋。多義性的一個假設原因是超位置,神經網絡通過將特徵分配給激活空間中的一個過完備方向集,而不是單個神經元,來表示比神經元更多的特徵。在這裡,我們嘗試識別這些方向,使用稀疏自編碼器來重建語言模型的內部激活。這些自編碼器學習一組稀疏激活的特徵,比起其他方法識別的方向更易解釋且單義,其中可解釋性是通過自動化方法衡量的。刪除這些特徵可以實現精確的模型編輯,例如,通過刪除代詞預測等功能,而比以往的技術更少地干擾模型行為。這項工作表明,可以使用可擴展的非監督方法解決語言模型中的超位置。我們的方法可能成為未來機械解釋性工作的基礎,我們希望這將實現更大的模型透明度和可操控性。
English
One of the roadblocks to a better understanding of neural networks' internals
is polysemanticity, where neurons appear to activate in multiple,
semantically distinct contexts. Polysemanticity prevents us from identifying
concise, human-understandable explanations for what neural networks are doing
internally. One hypothesised cause of polysemanticity is
superposition, where neural networks represent more features than they
have neurons by assigning features to an overcomplete set of directions in
activation space, rather than to individual neurons. Here, we attempt to
identify those directions, using sparse autoencoders to reconstruct the
internal activations of a language model. These autoencoders learn sets of
sparsely activating features that are more interpretable and monosemantic than
directions identified by alternative approaches, where interpretability is
measured by automated methods. Ablating these features enables precise model
editing, for example, by removing capabilities such as pronoun prediction,
while disrupting model behaviour less than prior techniques. This work
indicates that it is possible to resolve superposition in language models using
a scalable, unsupervised method. Our method may serve as a foundation for
future mechanistic interpretability work, which we hope will enable greater
model transparency and steerability.