基於反事實鏈與因果圖的大型語言模型可解釋性

摘要

因果圖提供了一種高階語言，有助於使機制透明化。近期研究利用大型語言模型來還原外部世界過程的因果圖。然而，在本論文中，我們採用因果圖來模擬大型語言模型本身的推論過程，讓利害關係人得以清楚了解模型如何感知與組織高階概念以產出預測。我們提出一個四階段方法來建構此類圖形。給定一個目標大型語言模型與一組文字範例，我們的方法能發掘具有類別區辨性、可被人理解的概念，並將每個輸入映射至模型所感知的概念狀態。接著，我們引入一項受MCMC啟發的反事實增強程序，透過一系列反事實鏈條來擴充稀疏的觀測數據。這使得搭配σ-CG進行穩定因果發現成為可能，從而產出具資訊量且可解釋的圖形。我們將此方法應用於三個大型語言模型，涵蓋疾病診斷、情感分析以及「大型語言模型作為評審」的分類任務。我們評估所學圖形的預測保真度與結構穩定性，並針對受MCMC啟發的增強程序評估其收斂性與下游應用效益。結果顯示，所發現的因果圖能捕捉與大型語言模型推理一致的有意義依賴關係。總而言之，本論文為大型語言模型的概念層級可解釋性奠定了基礎。

English

Causal graphs provide a high-level language for making mechanisms transparent. Recent work uses Large Language Models (LLMs) to recover causal graphs of external-world processes. Instead, in this paper, we use causal graphs to model LLM inference itself, providing stakeholders with a transparent view of how the model perceives and organizes high-level concepts to produce a prediction. We propose a four-phase method for constructing such graphs. Given a target LLM and a set of textual examples, our method discovers class-discriminative, human-interpretable concepts and maps each input to LLM-perceived concept states. We then introduce an MCMC-inspired counterfactual augmentation procedure that expands the sparse observational data through chains of counterfactuals. This enables stable causal discovery with σ-CG, yielding informative, interpretable graphs. We apply our method to three LLMs across disease diagnosis, sentiment analysis, and LLM-as-a-judge classification tasks. We evaluate the learned graphs for predictive fidelity and structural stability, and the MCMC-inspired augmentation for convergence and downstream utility. Our results show that the discovered causal graphs capture meaningful dependencies consistent with LLMs' reasoning. Together, this paper provides a foundation for concept-level explainability of LLMs.