LLM 잠재 공간 내 인과 개념 그래프를 활용한 단계적 추론

초록

희소 자동인코더는 언어 모델 내에서 개념의 위치를 특정할 수 있지만, 다단계 추론 과정에서 개념들이 어떻게 상호작용하는지는 파악하지 못합니다. 우리는 Causal Concept Graph(CCG)를 제안합니다: CCG는 희소하고 해석 가능한 잠재 특성들로 구성된 방향성 비순환 그래프로, 간선은 학습된 개념 간 인과적 의존성을 포착합니다. 우리는 개념 발견을 위한 작업 조건부 희소 자동인코더와 그래프 복원을 위한 DAGMA 방식의 미분 가능 구조 학습을 결합하고, 그래프 기반 개입이 무작위 개입보다 더 큰 하류 작업 효과를 유발하는지 평가하는 Causal Fidelity Score(CFS)를 도입했습니다. GPT-2 Medium을 사용한 ARC-Challenge, StrategyQA, LogiQA에서 5개의 시드에 걸쳐(n=15 쌍별 실행) CCG는 CFS=5.654±0.625를 달성하여 ROME 방식의 추적(3.382±0.233), SAE만을 이용한 순위 지정(2.479±0.196), 무작위 기준(1.032±0.034)을 능가했으며, 보퍼로니 수정 후 p<0.0001을 보였습니다. 학습된 그래프는 희소성(5-6% 간선 밀도), 도메인 특수성, 그리고 시드 간 안정성을 나타냈습니다.

English

Sparse autoencoders can localize where concepts live in language models, but not how they interact during multi-step reasoning. We propose Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features, where edges capture learned causal dependencies between concepts. We combine task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning for graph recovery and introduce the Causal Fidelity Score (CFS) to evaluate whether graph-guided interventions induce larger downstream effects than random ones. On ARC-Challenge, StrategyQA, and LogiQA with GPT-2 Medium, across five seeds (n{=}15 paired runs), CCG achieves CFS=5.654pm0.625, outperforming ROME-style tracing (3.382pm0.233), SAE-only ranking (2.479pm0.196), and a random baseline (1.032pm0.034), with p<0.0001 after Bonferroni correction. Learned graphs are sparse (5-6\% edge density), domain-specific, and stable across seeds.

LLM 잠재 공간 내 인과 개념 그래프를 활용한 단계적 추론

Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

초록

Support