电路分析的可解释性是否具有可扩展性?来自毛丝鼠多选能力的证据
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
July 18, 2023
作者: Tom Lieberum, Matthew Rahtz, János Kramár, Geoffrey Irving, Rohin Shah, Vladimir Mikulik
cs.AI
摘要
电路分析是一种有前途的技术,用于理解语言模型的内部机制。然而,现有的分析是在远未达到最新技术水平的小型模型上进行的。为了解决这个问题,我们提出了对70B Chinchilla模型进行电路分析的案例研究,旨在测试电路分析的可扩展性。具体而言,我们研究了多项选择题回答,并调查了Chinchilla在了解正确答案文本的情况下识别正确答案标签的能力。
我们发现现有的logit归因、注意力模式可视化和激活修补技术自然地适用于Chinchilla,使我们能够识别和分类一小组“输出节点”(注意力头和MLP)。
我们进一步研究了“正确字母”类别的注意力头,旨在理解其特征的语义,结果参差不齐。对于普通的多项选择题答案,我们在操作多项选择题答案标签时,显著压缩了头部的查询、键和值子空间,而性能没有损失,并且我们表明查询和键子空间在某种程度上代表“枚举中的第N个项目”特征。然而,当我们尝试使用这个解释来理解在包括随机答案标签在内的更一般分布上的头部行为时,我们发现这只是一个部分解释,表明我们还有更多关于“正确字母”头部在多项选择题回答中运作的内容需要学习。
English
Circuit analysis is a promising technique for understanding the
internal mechanisms of language models. However, existing analyses are done in
small models far from the state of the art. To address this, we present a case
study of circuit analysis in the 70B Chinchilla model, aiming to test the
scalability of circuit analysis. In particular, we study multiple-choice
question answering, and investigate Chinchilla's capability to identify the
correct answer label given knowledge of the correct answer text.
We find that the existing techniques of logit attribution, attention pattern
visualization, and activation patching naturally scale to Chinchilla, allowing
us to identify and categorize a small set of `output nodes' (attention heads
and MLPs).
We further study the `correct letter' category of attention heads aiming to
understand the semantics of their features, with mixed results. For normal
multiple-choice question answers, we significantly compress the query, key and
value subspaces of the head without loss of performance when operating on the
answer labels for multiple-choice questions, and we show that the query and key
subspaces represent an `Nth item in an enumeration' feature to at least some
extent. However, when we attempt to use this explanation to understand the
heads' behaviour on a more general distribution including randomized answer
labels, we find that it is only a partial explanation, suggesting there is more
to learn about the operation of `correct letter' heads on multiple choice
question answering.