大規模解釈可能性：Alpacaにおける因果メカニズムの特定

要旨

大規模で汎用性の高い言語モデルから人間が理解可能な説明を得ることは、AI安全性における緊急の課題である。しかし、解釈可能性の手法がモデルの振る舞いを支える因果的ダイナミクスに忠実であり、未見の入力に対しても頑健に一般化できることが同様に重要である。分散アライメント探索（DAS）は、因果的抽象化の理論に基づいた強力な勾配降下法であり、特定のタスクに微調整された小規模な深層学習モデルと解釈可能なシンボリックアルゴリズムとの完璧なアライメントを発見した。本論文では、残りの力任せな探索ステップを学習可能なパラメータに置き換えることで、DASを大幅にスケールアップする。このアプローチをDASと呼び、大規模言語モデルが指示に従う際に、解釈可能な因果構造を効率的に探索することを可能にする。我々はDASをAlpacaモデル（70億パラメータ）に適用し、そのままでは単純な数値推論問題を解くことを確認した。DASを用いて、Alpacaが2つの解釈可能なブール変数を持つ因果モデルを実装していることを発見した。さらに、これらの変数とニューラル表現のアライメントが、入力や指示の変化に対して頑健であることを見出した。これらの発見は、我々が最大規模で最も広く展開されている言語モデルの内部動作を深く理解するための第一歩を示すものである。

English

Obtaining human-interpretable explanations of large, general-purpose language models is an urgent goal for AI safety. However, it is just as important that our interpretability methods are faithful to the causal dynamics underlying model behavior and able to robustly generalize to unseen inputs. Distributed Alignment Search (DAS) is a powerful gradient descent method grounded in a theory of causal abstraction that uncovered perfect alignments between interpretable symbolic algorithms and small deep learning models fine-tuned for specific tasks. In the present paper, we scale DAS significantly by replacing the remaining brute-force search steps with learned parameters -- an approach we call DAS. This enables us to efficiently search for interpretable causal structure in large language models while they follow instructions. We apply DAS to the Alpaca model (7B parameters), which, off the shelf, solves a simple numerical reasoning problem. With DAS, we discover that Alpaca does this by implementing a causal model with two interpretable boolean variables. Furthermore, we find that the alignment of neural representations with these variables is robust to changes in inputs and instructions. These findings mark a first step toward deeply understanding the inner-workings of our largest and most widely deployed language models.

大規模解釈可能性：Alpacaにおける因果メカニズムの特定

Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

要旨

Support