反事実連鎖と因果グラフを用いたLLMの説明可能性

要旨

因果グラフは、メカニズムを透明化するための高水準の言語を提供する。最近の研究では、大規模言語モデル（LLM）を用いて外界のプロセスの因果グラフを復元している。それに対し、本論文では因果グラフを用いてLLMの推論そのものをモデル化し、モデルが予測を生成する際に高水準の概念をどのように認識・整理しているかをステークホルダーに透明に示す。我々は、そのようなグラフを構築するための4段階の手法を提案する。対象となるLLMと一連のテキスト例が与えられると、本手法はクラス識別可能で人間が解釈可能な概念を発見し、各入力をLLMが認識した概念状態にマッピングする。次に、MCMCに着想を得た反実仮想拡張手順を導入し、反実仮想の連鎖を通じて疎な観測データを拡張する。これにより、σ-CGを用いた安定した因果発見が可能となり、情報量が多く解釈可能なグラフが得られる。本手法を、疾患診断、感情分析、LLM-as-a-judge（LLMによる判定）の分類タスクにおいて3つのLLMに適用する。学習されたグラフの予測忠実性と構造的安定性、およびMCMCに着想を得た拡張手法の収束性と下流タスクでの有用性を評価する。結果は、発見された因果グラフがLLMの推論と整合する意味のある依存関係を捉えていることを示している。以上より、本論文はLLMの概念レベルでの説明可能性の基盤を提供する。

English

Causal graphs provide a high-level language for making mechanisms transparent. Recent work uses Large Language Models (LLMs) to recover causal graphs of external-world processes. Instead, in this paper, we use causal graphs to model LLM inference itself, providing stakeholders with a transparent view of how the model perceives and organizes high-level concepts to produce a prediction. We propose a four-phase method for constructing such graphs. Given a target LLM and a set of textual examples, our method discovers class-discriminative, human-interpretable concepts and maps each input to LLM-perceived concept states. We then introduce an MCMC-inspired counterfactual augmentation procedure that expands the sparse observational data through chains of counterfactuals. This enables stable causal discovery with σ-CG, yielding informative, interpretable graphs. We apply our method to three LLMs across disease diagnosis, sentiment analysis, and LLM-as-a-judge classification tasks. We evaluate the learned graphs for predictive fidelity and structural stability, and the MCMC-inspired augmentation for convergence and downstream utility. Our results show that the discovered causal graphs capture meaningful dependencies consistent with LLMs' reasoning. Together, this paper provides a foundation for concept-level explainability of LLMs.