大型语言模型潜在空间中的因果概念图及其逐步推理应用
Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning
March 11, 2026
作者: Md Muntaqim Meherab, Noor Islam S. Mohammad, Faiza Feroz
cs.AI
摘要
稀疏自编码器能够定位语言模型中概念的存在位置,但无法揭示多步推理过程中概念的交互机制。我们提出因果概念图(CCG):一种基于稀疏可解释潜在特征的有向无环图,其边捕获了概念间习得的因果依赖关系。我们将面向任务条件的稀疏自编码器用于概念发现,结合DAGMA风格的微分结构学习实现图结构恢复,并引入因果保真度评分(CFS)来评估图引导干预是否比随机干预产生更大的下游效应。在GPT-2 Medium模型上进行的ARC-Challenge、StrategyQA和LogiQA实验中,经过五个种子运行(n=15组配对实验),CCG获得CFS=5.654±0.625,显著优于ROME风格追踪法(3.382±0.233)、纯稀疏自编码器排序法(2.479±0.196)及随机基线(1.032±0.034),经Bonferroni校正后p值小于0.0001。习得的图结构具有稀疏性(边密度5-6%)、领域特异性,且在多个种子运行中保持稳定。
English
Sparse autoencoders can localize where concepts live in language models, but not how they interact during multi-step reasoning. We propose Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features, where edges capture learned causal dependencies between concepts. We combine task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning for graph recovery and introduce the Causal Fidelity Score (CFS) to evaluate whether graph-guided interventions induce larger downstream effects than random ones. On ARC-Challenge, StrategyQA, and LogiQA with GPT-2 Medium, across five seeds (n{=}15 paired runs), CCG achieves CFS=5.654pm0.625, outperforming ROME-style tracing (3.382pm0.233), SAE-only ranking (2.479pm0.196), and a random baseline (1.032pm0.034), with p<0.0001 after Bonferroni correction. Learned graphs are sparse (5-6\% edge density), domain-specific, and stable across seeds.