언어 모델을 활용한 블랙박스 텍스트 모듈의 자연어 설명

초록

대규모 언어 모델(LLM)은 점점 더 다양한 작업에서 놀라운 예측 성능을 보여주고 있습니다. 그러나 이들의 급속한 확산과 점점 더 불투명해지는 특성으로 인해 해석 가능성에 대한 필요성이 커지고 있습니다. 본 연구에서는 블랙박스 텍스트 모듈에 대한 자연어 설명을 자동으로 얻을 수 있는지에 대해 질문합니다. 여기서 "텍스트 모듈"이란 텍스트를 스칼라 연속 값으로 매핑하는 모든 함수를 의미하며, 이는 LLM 내부의 하위 모듈이나 뇌 영역의 피팅된 모델 등을 포함합니다. "블랙박스"는 모듈의 입력/출력에만 접근할 수 있음을 나타냅니다. 우리는 Summarize and Score(SASC)라는 방법을 소개합니다. 이 방법은 텍스트 모듈을 입력으로 받아 모듈의 선택성에 대한 자연어 설명과 설명의 신뢰도를 나타내는 점수를 반환합니다. 우리는 SASC를 세 가지 맥락에서 연구합니다. 먼저, 합성 모듈에 대해 SASC를 평가하여 종종 실제 설명을 복구할 수 있음을 확인했습니다. 둘째, 사전 훈련된 BERT 모델 내부의 모듈을 설명하기 위해 SASC를 사용하여 모델의 내부를 검사할 수 있게 했습니다. 마지막으로, SASC가 언어 자극에 대한 개별 fMRI 복셀의 반응을 설명할 수 있음을 보여주며, 이는 세밀한 뇌 매핑에 대한 잠재적 응용 가능성을 시사합니다. SASC 사용 및 결과 재현을 위한 모든 코드는 Github에서 공개되었습니다.

English

Large language models (LLMs) have demonstrated remarkable prediction performance for a growing array of tasks. However, their rapid proliferation and increasing opaqueness have created a growing need for interpretability. Here, we ask whether we can automatically obtain natural language explanations for black box text modules. A "text module" is any function that maps text to a scalar continuous value, such as a submodule within an LLM or a fitted model of a brain region. "Black box" indicates that we only have access to the module's inputs/outputs. We introduce Summarize and Score (SASC), a method that takes in a text module and returns a natural language explanation of the module's selectivity along with a score for how reliable the explanation is. We study SASC in 3 contexts. First, we evaluate SASC on synthetic modules and find that it often recovers ground truth explanations. Second, we use SASC to explain modules found within a pre-trained BERT model, enabling inspection of the model's internals. Finally, we show that SASC can generate explanations for the response of individual fMRI voxels to language stimuli, with potential applications to fine-grained brain mapping. All code for using SASC and reproducing results is made available on Github.

언어 모델을 활용한 블랙박스 텍스트 모듈의 자연어 설명

Explaining black box text modules in natural language with language models

초록

Support