立场：机械可解释性研究应优先关注稀疏自编码器中的特征一致性

摘要

稀疏自编码器（SAEs）是机制可解释性（MI）领域中的一项重要工具，用于将神经网络激活分解为可解释的特征。然而，识别一组标准特征的目标因观察到不同训练过程中学习到的SAE特征存在不一致性而面临挑战，这削弱了MI研究的可靠性和效率。本立场论文主张，机制可解释性应优先考虑SAEs中的特征一致性——即在独立运行中可靠地收敛到等效特征集。我们建议采用成对字典平均相关系数（PW-MCC）作为实际操作一致性的实用指标，并证明通过适当的架构选择可以实现高水平的一致性（在LLM激活上TopK SAEs的PW-MCC达到0.80）。我们的贡献包括详细阐述了优先考虑一致性的益处；提供了理论基础并通过模型生物体进行合成验证，证实PW-MCC是真实特征恢复的可靠代理；并将这些发现扩展到现实世界的LLM数据中，其中高特征一致性与学习到的特征解释的语义相似性高度相关。我们呼吁整个社区转向系统性测量特征一致性，以促进MI领域稳健的累积进展。

English

Sparse Autoencoders (SAEs) are a prominent tool in mechanistic interpretability (MI) for decomposing neural network activations into interpretable features. However, the aspiration to identify a canonical set of features is challenged by the observed inconsistency of learned SAE features across different training runs, undermining the reliability and efficiency of MI research. This position paper argues that mechanistic interpretability should prioritize feature consistency in SAEs -- the reliable convergence to equivalent feature sets across independent runs. We propose using the Pairwise Dictionary Mean Correlation Coefficient (PW-MCC) as a practical metric to operationalize consistency and demonstrate that high levels are achievable (0.80 for TopK SAEs on LLM activations) with appropriate architectural choices. Our contributions include detailing the benefits of prioritizing consistency; providing theoretical grounding and synthetic validation using a model organism, which verifies PW-MCC as a reliable proxy for ground-truth recovery; and extending these findings to real-world LLM data, where high feature consistency strongly correlates with the semantic similarity of learned feature explanations. We call for a community-wide shift towards systematically measuring feature consistency to foster robust cumulative progress in MI.

立场：机械可解释性研究应优先关注稀疏自编码器中的特征一致性

Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs

摘要

Support