回路分析による解釈可能性はスケールするのか？Chinchillaの多肢選択能力からの証拠

要旨

回路分析は、言語モデルの内部メカニズムを理解するための有望な技術である。しかし、既存の分析は、最先端から遠く離れた小さなモデルで行われている。この問題に対処するため、我々は70BのChinchillaモデルにおける回路分析のケーススタディを提示し、回路分析のスケーラビリティを検証することを目指す。特に、多肢選択問題の回答に焦点を当て、正解のテキストを知っている場合に正解のラベルを識別するChinchillaの能力を調査する。既存の技術であるロジット帰属、アテンションパターンの可視化、およびアクティベーションパッチングは、自然にChinchillaにスケールし、少数の「出力ノード」（アテンションヘッドとMLP）を特定し分類することを可能にした。さらに、「正解文字」カテゴリのアテンションヘッドを研究し、その特徴の意味を理解しようとしたが、結果はまちまちであった。通常の多肢選択問題の回答において、回答ラベルを操作する際に、ヘッドのクエリ、キー、および値の部分空間を性能を損なうことなく大幅に圧縮できることを示し、クエリとキーの部分空間が少なくともある程度「列挙中のN番目の項目」の特徴を表していることを示した。しかし、ランダム化された回答ラベルを含むより一般的な分布におけるヘッドの動作を理解するためにこの説明を使用しようとすると、それは部分的な説明に過ぎず、多肢選択問題の回答における「正解文字」ヘッドの動作についてさらに学ぶべきことがあることを示唆している。

English

Circuit analysis is a promising technique for understanding the internal mechanisms of language models. However, existing analyses are done in small models far from the state of the art. To address this, we present a case study of circuit analysis in the 70B Chinchilla model, aiming to test the scalability of circuit analysis. In particular, we study multiple-choice question answering, and investigate Chinchilla's capability to identify the correct answer label given knowledge of the correct answer text. We find that the existing techniques of logit attribution, attention pattern visualization, and activation patching naturally scale to Chinchilla, allowing us to identify and categorize a small set of `output nodes' (attention heads and MLPs). We further study the `correct letter' category of attention heads aiming to understand the semantics of their features, with mixed results. For normal multiple-choice question answers, we significantly compress the query, key and value subspaces of the head without loss of performance when operating on the answer labels for multiple-choice questions, and we show that the query and key subspaces represent an `Nth item in an enumeration' feature to at least some extent. However, when we attempt to use this explanation to understand the heads' behaviour on a more general distribution including randomized answer labels, we find that it is only a partial explanation, suggesting there is more to learn about the operation of `correct letter' heads on multiple choice question answering.

回路分析による解釈可能性はスケールするのか？Chinchillaの多肢選択能力からの証拠

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

要旨

Support