机制化数据溯源:可解释性大模型单元的训源追踪
Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units
January 29, 2026
作者: Jianhui Chen, Yuzhang Luo, Liangming Pan
cs.AI
摘要
尽管机械可解释性研究已在大型语言模型中发现可解释电路,但其在训练数据中的因果起源仍不明确。我们提出机械数据归因(MDA)框架,该可扩展方法利用影响函数将可解释单元溯源至特定训练样本。通过对Pythia模型系列的广泛实验,我们因果验证了目标干预——移除或增强少量高影响力样本——能显著调控可解释注意力头的形成,而随机干预则无此效果。分析表明,重复性结构化数据(如LaTeX、XML)发挥着机械催化剂作用。此外,针对归纳头形成的干预会同步改变模型的上下文学习能力,这为归纳头与上下文学习功能关联的长期假说提供了直接因果证据。最后,我们提出一种机械数据增强流程,能持续加速不同规模模型的电路收敛,为引导大语言模型发展轨迹提供了原理性方法。
English
While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.