基于反事实链与因果图的大语言模型可解释性

摘要

因果图为揭示机制提供了高级语言。近期研究利用大型语言模型（LLMs）来恢复外部世界过程的因果图。然而，本文转而使用因果图对LLM推理本身进行建模，为利益相关者提供关于模型如何感知和组织高层概念以产生预测的透明视图。我们提出了一种构建此类图的四阶段方法。给定目标LLM和一组文本示例，我们的方法能够发现具有类别区分性且人类可解释的概念，并将每个输入映射到LLM感知的概念状态。随后，我们引入了一种受MCMC启发的反事实增强过程，通过反事实链扩展稀疏的观测数据。这使得基于σ-CG的稳定因果发现成为可能，并生成信息丰富且可解释的图。我们将该方法应用于三个LLM，涵盖疾病诊断、情感分析和LLM作为裁判的分类任务。我们评估了所学图的预测保真度和结构稳定性，以及受MCMC启发的增强过程的收敛性和下游效用。结果表明，所发现的因果图捕捉到了与LLM推理一致的有意义依赖关系。综上所述，本文为LLM的概念级可解释性奠定了基础。

English

Causal graphs provide a high-level language for making mechanisms transparent. Recent work uses Large Language Models (LLMs) to recover causal graphs of external-world processes. Instead, in this paper, we use causal graphs to model LLM inference itself, providing stakeholders with a transparent view of how the model perceives and organizes high-level concepts to produce a prediction. We propose a four-phase method for constructing such graphs. Given a target LLM and a set of textual examples, our method discovers class-discriminative, human-interpretable concepts and maps each input to LLM-perceived concept states. We then introduce an MCMC-inspired counterfactual augmentation procedure that expands the sparse observational data through chains of counterfactuals. This enables stable causal discovery with σ-CG, yielding informative, interpretable graphs. We apply our method to three LLMs across disease diagnosis, sentiment analysis, and LLM-as-a-judge classification tasks. We evaluate the learned graphs for predictive fidelity and structural stability, and the MCMC-inspired augmentation for convergence and downstream utility. Our results show that the discovered causal graphs capture meaningful dependencies consistent with LLMs' reasoning. Together, this paper provides a foundation for concept-level explainability of LLMs.