반사실적 체인과 인과 그래프를 활용한 LLM 설명 가능성

초록

인과 그래프는 메커니즘을 투명하게 만들기 위한 고수준 언어를 제공한다. 최근 연구에서는 대규모 언어 모델(LLM)을 사용하여 외부 세계 프로세스의 인과 그래프를 복구한다. 대신, 본 논문에서는 LLM 추론 자체를 모델링하기 위해 인과 그래프를 사용하여, 모델이 예측을 생성하기 위해 고수준 개념을 어떻게 인식하고 구성하는지에 대한 투명한 관점을 이해관계자에게 제공한다. 우리는 이러한 그래프를 구성하기 위한 4단계 방법을 제안한다. 주어진 대상 LLM과 텍스트 예제 집합에 대해, 우리의 방법은 클래스 판별적이고 인간이 해석 가능한 개념을 발견하고 각 입력을 LLM이 인지한 개념 상태에 매핑한다. 그런 다음 MCMC에서 영감을 받은 반사실적 증강 절차를 도입하여, 반사실적 체인을 통해 희소한 관측 데이터를 확장한다. 이를 통해 σ-CG로 안정적인 인과 발견이 가능해지며, 정보성 있고 해석 가능한 그래프를 생성한다. 우리는 질병 진단, 감정 분석, 그리고 LLM-as-a-judge 분류 작업에 걸쳐 세 가지 LLM에 우리의 방법을 적용한다. 학습된 그래프의 예측 충실도와 구조적 안정성을 평가하고, MCMC에서 영감을 받은 증강의 수렴성과 하류 작업 유용성을 평가한다. 우리의 결과는 발견된 인과 그래프가 LLM의 추론과 일관된 의미 있는 의존성을 포착함을 보여준다. 종합적으로, 본 논문은 LLM의 개념 수준 설명 가능성을 위한 기초를 제공한다.

English

Causal graphs provide a high-level language for making mechanisms transparent. Recent work uses Large Language Models (LLMs) to recover causal graphs of external-world processes. Instead, in this paper, we use causal graphs to model LLM inference itself, providing stakeholders with a transparent view of how the model perceives and organizes high-level concepts to produce a prediction. We propose a four-phase method for constructing such graphs. Given a target LLM and a set of textual examples, our method discovers class-discriminative, human-interpretable concepts and maps each input to LLM-perceived concept states. We then introduce an MCMC-inspired counterfactual augmentation procedure that expands the sparse observational data through chains of counterfactuals. This enables stable causal discovery with σ-CG, yielding informative, interpretable graphs. We apply our method to three LLMs across disease diagnosis, sentiment analysis, and LLM-as-a-judge classification tasks. We evaluate the learned graphs for predictive fidelity and structural stability, and the MCMC-inspired augmentation for convergence and downstream utility. Our results show that the discovered causal graphs capture meaningful dependencies consistent with LLMs' reasoning. Together, this paper provides a foundation for concept-level explainability of LLMs.