ChatPaper.aiChatPaper

项目阿里阿德涅:基于结构因果模型的LLM智能体忠实度审计框架

Project Ariadne: A Structural Causal Framework for Auditing Faithfulness in LLM Agents

January 5, 2026
作者: Sourena Khanzadeh
cs.AI

摘要

随着大型语言模型(LLM)智能体越来越多地承担高风险的自主决策任务,其推理过程的透明度已成为关键的安全问题。尽管思维链(CoT)提示技术能让智能体生成人类可读的推理轨迹,但这些轨迹究竟是模型输出的真实生成驱动因素,还是仅为事后归因的合理化解释,目前仍不明确。我们推出Ariadne项目——一个创新的可解释人工智能(XAI)框架,通过结构因果模型(SCM)和反事实逻辑来审计智能体推理的因果完整性。与依赖表层文本相似度的现有可解释性方法不同,Ariadne项目对中间推理节点实施硬干预(do-演算),系统性地进行逻辑反转、前提否定和事实主张逆转,以测量最终答案的因果敏感度(φ)。我们对前沿模型的实证研究揭示了持续存在的忠实性差距:定义并检测到一种普遍存在的故障模式“因果解耦”,智能体在事实和科学领域的违规密度(ρ)高达0.77。这些案例中,智能体在内部逻辑矛盾的情况下仍得出相同结论,证明其推理轨迹实为“推理剧场”,而决策过程实则受潜在参数先验支配。我们的研究结果表明,当前智能体架构本质上存在不忠实解释的倾向,为此我们提出将Ariadne评分作为衡量陈述逻辑与模型行动一致性的新基准。
English
As Large Language Model (LLM) agents are increasingly tasked with high-stakes autonomous decision-making, the transparency of their reasoning processes has become a critical safety concern. While Chain-of-Thought (CoT) prompting allows agents to generate human-readable reasoning traces, it remains unclear whether these traces are faithful generative drivers of the model's output or merely post-hoc rationalizations. We introduce Project Ariadne, a novel XAI framework that utilizes Structural Causal Models (SCMs) and counterfactual logic to audit the causal integrity of agentic reasoning. Unlike existing interpretability methods that rely on surface-level textual similarity, Project Ariadne performs hard interventions (do-calculus) on intermediate reasoning nodes -- systematically inverting logic, negating premises, and reversing factual claims -- to measure the Causal Sensitivity (φ) of the terminal answer. Our empirical evaluation of state-of-the-art models reveals a persistent Faithfulness Gap. We define and detect a widespread failure mode termed Causal Decoupling, where agents exhibit a violation density (ρ) of up to 0.77 in factual and scientific domains. In these instances, agents arrive at identical conclusions despite contradictory internal logic, proving that their reasoning traces function as "Reasoning Theater" while decision-making is governed by latent parametric priors. Our findings suggest that current agentic architectures are inherently prone to unfaithful explanation, and we propose the Ariadne Score as a new benchmark for aligning stated logic with model action.
PDF01January 7, 2026