大规模的可解释性：在羊驼中识别因果机制

摘要

為了確保 AI 安全，獲取大型通用語言模型的人類可解釋性解釋是一個迫切的目標。然而，同樣重要的是，我們的可解釋性方法要忠實於模型行為背後的因果動力學，並能夠穩健地推廣到未見過的輸入。分佈式對齊搜索（DAS）是一種強大的梯度下降方法，基於一種揭示可解釋符號算法與為特定任務微調的小型深度學習模型之間完美對齊的因果抽象理論。在本文中，我們通過用學習參數取代剩餘的暴力搜索步驟，顯著擴展了 DAS，這種方法被稱為 DAS。這使我們能夠在大型語言模型中高效地搜索可解釋的因果結構，同時它們遵循指令。我們將 DAS 應用於 Alpaca 模型（7B 參數），該模型可以解決一個簡單的數值推理問題。通過 DAS，我們發現 Alpaca 通過實現一個具有兩個可解釋布爾變量的因果模型來完成這一任務。此外，我們發現神經表示與這些變量的對齊對於輸入和指令的變化是穩健的。這些發現標誌著深入了解我們最大型且被廣泛部署的語言模型內部運作的第一步。

English

Obtaining human-interpretable explanations of large, general-purpose language models is an urgent goal for AI safety. However, it is just as important that our interpretability methods are faithful to the causal dynamics underlying model behavior and able to robustly generalize to unseen inputs. Distributed Alignment Search (DAS) is a powerful gradient descent method grounded in a theory of causal abstraction that uncovered perfect alignments between interpretable symbolic algorithms and small deep learning models fine-tuned for specific tasks. In the present paper, we scale DAS significantly by replacing the remaining brute-force search steps with learned parameters -- an approach we call DAS. This enables us to efficiently search for interpretable causal structure in large language models while they follow instructions. We apply DAS to the Alpaca model (7B parameters), which, off the shelf, solves a simple numerical reasoning problem. With DAS, we discover that Alpaca does this by implementing a causal model with two interpretable boolean variables. Furthermore, we find that the alignment of neural representations with these variables is robust to changes in inputs and instructions. These findings mark a first step toward deeply understanding the inner-workings of our largest and most widely deployed language models.

大规模的可解释性：在羊驼中识别因果机制

Interpretability at Scale: Identifying Causal Mechanisms in Alpaca

摘要

Support