项目阿里阿德涅:面向LLM智能体忠实性审计的结构化因果框架
Project Ariadne: A Structural Causal Framework for Auditing Faithfulness in LLM Agents
January 5, 2026
作者: Sourena Khanzadeh
cs.AI
摘要
随着大语言模型(LLM)智能体越来越多地承担高风险自主决策任务,其推理过程的透明度已成为关键的安全问题。尽管思维链(CoT)提示技术能生成人类可读的推理轨迹,但这些轨迹究竟是模型输出的真实生成驱动力,抑或是事后合理化解释,目前尚不明确。我们推出Ariadne项目——一个创新可解释人工智能(XAI)框架,利用结构因果模型(SCMs)与反事实逻辑来审计智能体推理的因果完整性。与依赖表层文本相似度的现有可解释性方法不同,该项目通过对中间推理节点实施硬干预(do-演算),系统性地进行逻辑逆转、前提否定和事实主张反转,以量化终端答案的因果敏感度(φ)。针对前沿模型的实证研究揭示了持续存在的忠实性差距:我们定义并检测到一种普遍存在的故障模式“因果解耦”,在事实与科学领域智能体的违规密度(ρ)高达0.77。这些案例中,智能体在内部逻辑矛盾的情况下仍得出相同结论,证明其推理轨迹实为“推理剧场”,而决策过程实则受潜在参数先验支配。我们的研究表明当前智能体架构存在固有的事后解释风险,并提出将Ariadne评分作为衡量陈述逻辑与模型行动一致性的新基准。
English
As Large Language Model (LLM) agents are increasingly tasked with high-stakes autonomous decision-making, the transparency of their reasoning processes has become a critical safety concern. While Chain-of-Thought (CoT) prompting allows agents to generate human-readable reasoning traces, it remains unclear whether these traces are faithful generative drivers of the model's output or merely post-hoc rationalizations. We introduce Project Ariadne, a novel XAI framework that utilizes Structural Causal Models (SCMs) and counterfactual logic to audit the causal integrity of agentic reasoning. Unlike existing interpretability methods that rely on surface-level textual similarity, Project Ariadne performs hard interventions (do-calculus) on intermediate reasoning nodes -- systematically inverting logic, negating premises, and reversing factual claims -- to measure the Causal Sensitivity (φ) of the terminal answer. Our empirical evaluation of state-of-the-art models reveals a persistent Faithfulness Gap. We define and detect a widespread failure mode termed Causal Decoupling, where agents exhibit a violation density (ρ) of up to 0.77 in factual and scientific domains. In these instances, agents arrive at identical conclusions despite contradictory internal logic, proving that their reasoning traces function as "Reasoning Theater" while decision-making is governed by latent parametric priors. Our findings suggest that current agentic architectures are inherently prone to unfaithful explanation, and we propose the Ariadne Score as a new benchmark for aligning stated logic with model action.