立场:机械可解释性研究应优先关注稀疏自编码器中的特征一致性
Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs
May 26, 2025
作者: Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, Kun Zhang
cs.AI
摘要
稀疏自编码器(SAEs)是机制可解释性(MI)领域中的一项重要工具,用于将神经网络激活分解为可解释的特征。然而,识别一组标准特征的目标因观察到不同训练过程中学习到的SAE特征存在不一致性而面临挑战,这削弱了MI研究的可靠性和效率。本立场论文主张,机制可解释性应优先考虑SAEs中的特征一致性——即在独立运行中可靠地收敛到等效特征集。我们建议采用成对字典平均相关系数(PW-MCC)作为实际操作一致性的实用指标,并证明通过适当的架构选择可以实现高水平的一致性(在LLM激活上TopK SAEs的PW-MCC达到0.80)。我们的贡献包括详细阐述了优先考虑一致性的益处;提供了理论基础并通过模型生物体进行合成验证,证实PW-MCC是真实特征恢复的可靠代理;并将这些发现扩展到现实世界的LLM数据中,其中高特征一致性与学习到的特征解释的语义相似性高度相关。我们呼吁整个社区转向系统性测量特征一致性,以促进MI领域稳健的累积进展。
English
Sparse Autoencoders (SAEs) are a prominent tool in mechanistic
interpretability (MI) for decomposing neural network activations into
interpretable features. However, the aspiration to identify a canonical set of
features is challenged by the observed inconsistency of learned SAE features
across different training runs, undermining the reliability and efficiency of
MI research. This position paper argues that mechanistic interpretability
should prioritize feature consistency in SAEs -- the reliable convergence to
equivalent feature sets across independent runs. We propose using the Pairwise
Dictionary Mean Correlation Coefficient (PW-MCC) as a practical metric to
operationalize consistency and demonstrate that high levels are achievable
(0.80 for TopK SAEs on LLM activations) with appropriate architectural choices.
Our contributions include detailing the benefits of prioritizing consistency;
providing theoretical grounding and synthetic validation using a model
organism, which verifies PW-MCC as a reliable proxy for ground-truth recovery;
and extending these findings to real-world LLM data, where high feature
consistency strongly correlates with the semantic similarity of learned feature
explanations. We call for a community-wide shift towards systematically
measuring feature consistency to foster robust cumulative progress in MI.Summary
AI-Generated Summary