立場：機械可解釋性應優先關注SAE中的特徵一致性

摘要

稀疏自编码器（Sparse Autoencoders, SAEs）是机制可解释性（Mechanistic Interpretability, MI）领域中用于将神经网络激活分解为可解释特征的重要工具。然而，识别一组规范特征的愿望因不同训练运行中学习到的SAE特征的不一致性而受到挑战，这削弱了MI研究的可靠性和效率。本立场文件主张，机制可解释性应优先考虑SAE中的特征一致性——即在不同独立运行中可靠地收敛到等效特征集。我们建议使用成对字典平均相关系数（Pairwise Dictionary Mean Correlation Coefficient, PW-MCC）作为操作化一致性的实用指标，并证明通过适当的架构选择可以实现高水平的一致性（在LLM激活上，TopK SAEs的PW-MCC达到0.80）。我们的贡献包括详细阐述了优先考虑一致性的好处；提供了理论依据和利用模型生物进行的合成验证，验证了PW-MCC作为真实恢复的可靠代理；并将这些发现扩展到现实世界的LLM数据，其中高特征一致性与学习到的特征解释的语义相似性密切相关。我们呼吁整个社区转向系统地测量特征一致性，以促进MI领域稳健的累积进展。

English

Sparse Autoencoders (SAEs) are a prominent tool in mechanistic interpretability (MI) for decomposing neural network activations into interpretable features. However, the aspiration to identify a canonical set of features is challenged by the observed inconsistency of learned SAE features across different training runs, undermining the reliability and efficiency of MI research. This position paper argues that mechanistic interpretability should prioritize feature consistency in SAEs -- the reliable convergence to equivalent feature sets across independent runs. We propose using the Pairwise Dictionary Mean Correlation Coefficient (PW-MCC) as a practical metric to operationalize consistency and demonstrate that high levels are achievable (0.80 for TopK SAEs on LLM activations) with appropriate architectural choices. Our contributions include detailing the benefits of prioritizing consistency; providing theoretical grounding and synthetic validation using a model organism, which verifies PW-MCC as a reliable proxy for ground-truth recovery; and extending these findings to real-world LLM data, where high feature consistency strongly correlates with the semantic similarity of learned feature explanations. We call for a community-wide shift towards systematically measuring feature consistency to foster robust cumulative progress in MI.

立場：機械可解釋性應優先關注SAE中的特徵一致性

Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs

摘要

Support