立場：機械的解釈可能性はSAEにおける特徴の一貫性を優先すべきである

要旨

スパースオートエンコーダ（SAE）は、ニューラルネットワークの活性化を解釈可能な特徴に分解するための機構的解釈性（MI）研究において重要なツールです。しかし、異なる訓練実行間で学習されたSAE特徴の一貫性が観察されないことにより、正準的な特徴セットを特定するという目標が阻まれ、MI研究の信頼性と効率性が損なわれています。本ポジションペーパーでは、機構的解釈性においてSAEの特徴一貫性（独立した実行間で同等の特徴セットに確実に収束すること）を優先すべきであると主張します。我々は、一貫性を実践的に測定するための指標としてペアワイズ辞書平均相関係数（PW-MCC）を提案し、適切なアーキテクチャ選択により高いレベル（LLM活性化におけるTopK SAEで0.80）が達成可能であることを示します。我々の貢献は、一貫性を優先することの利点を詳細に説明すること、モデル生物を用いた理論的根拠と合成データによる検証を提供しPW-MCCが真の特徴回復の信頼できる代理指標であることを確認すること、そしてこれらの知見を実世界のLLMデータに拡張し、高い特徴一貫性が学習された特徴説明の意味的類似性と強く相関することを示すことです。我々は、MI研究における堅牢な累積的進展を促進するため、コミュニティ全体が体系的に特徴一貫性を測定する方向にシフトすることを呼びかけます。

English

Sparse Autoencoders (SAEs) are a prominent tool in mechanistic interpretability (MI) for decomposing neural network activations into interpretable features. However, the aspiration to identify a canonical set of features is challenged by the observed inconsistency of learned SAE features across different training runs, undermining the reliability and efficiency of MI research. This position paper argues that mechanistic interpretability should prioritize feature consistency in SAEs -- the reliable convergence to equivalent feature sets across independent runs. We propose using the Pairwise Dictionary Mean Correlation Coefficient (PW-MCC) as a practical metric to operationalize consistency and demonstrate that high levels are achievable (0.80 for TopK SAEs on LLM activations) with appropriate architectural choices. Our contributions include detailing the benefits of prioritizing consistency; providing theoretical grounding and synthetic validation using a model organism, which verifies PW-MCC as a reliable proxy for ground-truth recovery; and extending these findings to real-world LLM data, where high feature consistency strongly correlates with the semantic similarity of learned feature explanations. We call for a community-wide shift towards systematically measuring feature consistency to foster robust cumulative progress in MI.

立場：機械的解釈可能性はSAEにおける特徴の一貫性を優先すべきである

Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs

要旨

Support