大規模言語モデルの潜在空間における段階的推論のための因果概念グラフ

要旨

スパースオートエンコーダは言語モデル内における概念の位置を特定できるが、多段階推論における概念間の相互作用は捉えられない。本研究ではCausal Concept Graph（CCG）を提案する。これはスパースで解釈可能な潜在特徴上の有向非巡回グラフであり、エッジが学習された概念間の因果依存関係を捕捉する。概念発見のためのタスク条件付きスパースオートエンコーダと、グラフ復元のためのDAGMAスタイルの微分可能構造学習を組み合わせ、グラフ誘導型介入がランダム介入よりも大きな下流効果を誘発するかを評価するCausal Fidelity Score（CFS）を導入する。GPT-2 Mediumを用いたARC-Challenge、StrategyQA、LogiQAにおいて、5シード（n=15のペア実行）にわたる評価では、CCGはCFS=5.654±0.625を達成し、ROMEスタイルのトレーシング（3.382±0.233）、SAEのみのランキング（2.479±0.196）、ランダムベースライン（1.032±0.034）を有意に上回った（Bonferroni補正後p<0.0001）。学習されたグラフはスパース性（エッジ密度5-6%）、ドメイン特異性、シード間での安定性を備えている。

English

Sparse autoencoders can localize where concepts live in language models, but not how they interact during multi-step reasoning. We propose Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features, where edges capture learned causal dependencies between concepts. We combine task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning for graph recovery and introduce the Causal Fidelity Score (CFS) to evaluate whether graph-guided interventions induce larger downstream effects than random ones. On ARC-Challenge, StrategyQA, and LogiQA with GPT-2 Medium, across five seeds (n{=}15 paired runs), CCG achieves CFS=5.654pm0.625, outperforming ROME-style tracing (3.382pm0.233), SAE-only ranking (2.479pm0.196), and a random baseline (1.032pm0.034), with p<0.0001 after Bonferroni correction. Learned graphs are sparse (5-6\% edge density), domain-specific, and stable across seeds.

大規模言語モデルの潜在空間における段階的推論のための因果概念グラフ

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

要旨

Support