使用語言模型以自然語言解釋黑盒文本模組

摘要

大型語言模型（LLMs）展示了對越來越多任務的卓越預測性能。然而，它們的快速擴散和日益不透明性引發了對可解釋性的需求。在這裡，我們探討是否可以自動獲得黑盒文本模塊的自然語言解釋。所謂的「文本模塊」是指將文本映射到連續標量值的任何函數，例如LLM內的子模塊或大腦區域的擬合模型。"黑盒"表示我們只能訪問模塊的輸入/輸出。我們提出了Summarize and Score（SASC）方法，該方法接受一個文本模塊並返回模塊選擇性的自然語言解釋，以及解釋可靠性的分數。我們在3個情境下研究SASC。首先，我們在合成模塊上評估SASC，發現它通常可以恢復地面真相解釋。其次，我們使用SASC來解釋預先訓練的BERT模型中找到的模塊，從而審查模型的內部。最後，我們展示SASC可以為個別fMRI像素對語言刺激的響應生成解釋，具有應用於精細腦部映射的潛力。所有使用SASC和重現結果的代碼都在Github上提供。

English

Large language models (LLMs) have demonstrated remarkable prediction performance for a growing array of tasks. However, their rapid proliferation and increasing opaqueness have created a growing need for interpretability. Here, we ask whether we can automatically obtain natural language explanations for black box text modules. A "text module" is any function that maps text to a scalar continuous value, such as a submodule within an LLM or a fitted model of a brain region. "Black box" indicates that we only have access to the module's inputs/outputs. We introduce Summarize and Score (SASC), a method that takes in a text module and returns a natural language explanation of the module's selectivity along with a score for how reliable the explanation is. We study SASC in 3 contexts. First, we evaluate SASC on synthetic modules and find that it often recovers ground truth explanations. Second, we use SASC to explain modules found within a pre-trained BERT model, enabling inspection of the model's internals. Finally, we show that SASC can generate explanations for the response of individual fMRI voxels to language stimuli, with potential applications to fine-grained brain mapping. All code for using SASC and reproducing results is made available on Github.

使用語言模型以自然語言解釋黑盒文本模組

Explaining black box text modules in natural language with language models

摘要

Support