大型语言模型潜在空间中的因果概念图及其逐步推理机制

摘要

稀疏自编码器能够定位语言模型中概念的存在位置，但无法揭示多步推理过程中概念的交互机制。我们提出因果概念图（CCG）：一种基于稀疏可解释潜在特征的有向无环图，其边捕捉了概念间习得的因果依赖关系。我们将面向任务的条件稀疏自编码器用于概念发现，结合DAGMA式可微分结构学习实现图结构恢复，并引入因果保真度评分（CFS）来评估图引导干预是否比随机干预产生更大的下游效应。在GPT-2 Medium模型上进行的ARC挑战赛、StrategyQA和LogiQA实验中，经过五个种子运行（n=15组配对实验），CCG取得CFS=5.654±0.625的成绩，显著优于ROME式追踪法（3.382±0.233）、纯稀疏自编码器排序法（2.479±0.196）及随机基线（1.032±0.034），经Bonferroni校正后p值小于0.0001。习得的概念图具有稀疏性（边密度5-6%）、领域特异性，且在种子间保持稳定。

English

Sparse autoencoders can localize where concepts live in language models, but not how they interact during multi-step reasoning. We propose Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features, where edges capture learned causal dependencies between concepts. We combine task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning for graph recovery and introduce the Causal Fidelity Score (CFS) to evaluate whether graph-guided interventions induce larger downstream effects than random ones. On ARC-Challenge, StrategyQA, and LogiQA with GPT-2 Medium, across five seeds (n{=}15 paired runs), CCG achieves CFS=5.654pm0.625, outperforming ROME-style tracing (3.382pm0.233), SAE-only ranking (2.479pm0.196), and a random baseline (1.032pm0.034), with p<0.0001 after Bonferroni correction. Learned graphs are sparse (5-6\% edge density), domain-specific, and stable across seeds.