電路分析的可解釋性是否具有規模性?從海松鼠的多重選擇能力中找到證據
Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla
July 18, 2023
作者: Tom Lieberum, Matthew Rahtz, János Kramár, Geoffrey Irving, Rohin Shah, Vladimir Mikulik
cs.AI
摘要
電路分析是一種有潛力用於理解語言模型內部機制的技術。然而,現有的分析是在遠離最先進技術水平的小型模型中進行的。為了應對這一挑戰,我們提出了在70B Chinchilla模型中進行電路分析的案例研究,旨在測試電路分析的可擴展性。具體來說,我們研究了多項選擇題回答,並調查了Chinchilla在知道正確答案文本的情況下識別正確答案標籤的能力。
我們發現現有的技術,如logit歸因、注意力模式可視化和激活補丁,在Chinchilla上自然地擴展,使我們能夠識別和分類一小組“輸出節點”(注意力頭和MLP)。
我們進一步研究了“正確字母”類別的注意力頭,旨在理解它們特徵的語義,結果參差不齊。對於普通的多項選擇問題答案,我們在處理多項選擇問題的答案標籤時,明顯壓縮了注意力頭的查詢、鍵和值子空間,而且沒有性能損失,並且我們展示了查詢和鍵子空間在某種程度上代表“列舉中的第N個項目”特徵。然而,當我們試圖使用這個解釋來理解在包括隨機答案標籤的更一般分佈上的頭部行為時,我們發現這只是部分解釋,這表明我們還有更多關於“正確字母”頭在多項選擇問答中運作的知識需要學習。
English
Circuit analysis is a promising technique for understanding the
internal mechanisms of language models. However, existing analyses are done in
small models far from the state of the art. To address this, we present a case
study of circuit analysis in the 70B Chinchilla model, aiming to test the
scalability of circuit analysis. In particular, we study multiple-choice
question answering, and investigate Chinchilla's capability to identify the
correct answer label given knowledge of the correct answer text.
We find that the existing techniques of logit attribution, attention pattern
visualization, and activation patching naturally scale to Chinchilla, allowing
us to identify and categorize a small set of `output nodes' (attention heads
and MLPs).
We further study the `correct letter' category of attention heads aiming to
understand the semantics of their features, with mixed results. For normal
multiple-choice question answers, we significantly compress the query, key and
value subspaces of the head without loss of performance when operating on the
answer labels for multiple-choice questions, and we show that the query and key
subspaces represent an `Nth item in an enumeration' feature to at least some
extent. However, when we attempt to use this explanation to understand the
heads' behaviour on a more general distribution including randomized answer
labels, we find that it is only a partial explanation, suggesting there is more
to learn about the operation of `correct letter' heads on multiple choice
question answering.