ChatPaper.aiChatPaper

大型语言模型潜在空间中的因果概念图及其逐步推理机制

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

March 11, 2026
作者: Md Muntaqim Meherab, Noor Islam S. Mohammad, Faiza Feroz
cs.AI

摘要

稀疏自编码器能够定位语言模型中概念的存在位置,但无法揭示多步推理过程中概念的交互机制。我们提出因果概念图(CCG):一种基于稀疏可解释潜在特征的有向无环图,其边捕捉了概念间习得的因果依赖关系。我们将面向任务的条件稀疏自编码器用于概念发现,结合DAGMA式可微分结构学习实现图结构恢复,并引入因果保真度评分(CFS)来评估图引导干预是否比随机干预产生更大的下游效应。在GPT-2 Medium模型上进行的ARC挑战赛、StrategyQA和LogiQA实验中,经过五个种子运行(n=15组配对实验),CCG取得CFS=5.654±0.625的成绩,显著优于ROME式追踪法(3.382±0.233)、纯稀疏自编码器排序法(2.479±0.196)及随机基线(1.032±0.034),经Bonferroni校正后p值小于0.0001。习得的概念图具有稀疏性(边密度5-6%)、领域特异性,且在种子间保持稳定。
English
Sparse autoencoders can localize where concepts live in language models, but not how they interact during multi-step reasoning. We propose Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features, where edges capture learned causal dependencies between concepts. We combine task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning for graph recovery and introduce the Causal Fidelity Score (CFS) to evaluate whether graph-guided interventions induce larger downstream effects than random ones. On ARC-Challenge, StrategyQA, and LogiQA with GPT-2 Medium, across five seeds (n{=}15 paired runs), CCG achieves CFS=5.654pm0.625, outperforming ROME-style tracing (3.382pm0.233), SAE-only ranking (2.479pm0.196), and a random baseline (1.032pm0.034), with p<0.0001 after Bonferroni correction. Learned graphs are sparse (5-6\% edge density), domain-specific, and stable across seeds.
PDF31March 13, 2026