不安定な特徴、再現可能な部分空間：スパースオートエンコーダにおけるシード依存性の理解

要旨

スパースオートエンコーダ（SAE）はニューラルネットワーク表現の解釈に広く用いられているが、その有用性は学習された特徴がトレーニング実行間で再現可能かどうかに依存する。我々はこの問題を特徴安定性を通じて研究する。すなわち、各SAE特徴について、独立に訓練されたSAEに類似した特徴が再現される確率を推定する。これにより、安定な特徴と不安定な特徴を分離する、スケーラブルな特徴ごとのシグナルが得られる。シード、モデル、層、辞書サイズ、SAEのバリエーションにわたる大規模研究において、顕著な機能的不均衡が確認された。安定な特徴は再構築および予測に関連するシグナルの大部分を担う一方、不安定な特徴は限界的影響が弱く、活性化統計と自動説明の両方において低頻度の表層トリガーが支配的である。幾何学的には、不安定な特徴は個々には非再現性であるが、再現可能な低ランク部分空間に集中しており、シード依存性は純粋なノイズよりも、活性化空間の共有領域内における基底の曖昧性を反映することが示唆される。制御された合成モデルによりこのメカニズムが明示的に示され、低ランクの真の特徴が部分空間レベルでは復元可能である一方、個々のSAE潜在変数としてはシード間で識別不可能であることが確認される。最後に、独自のクロスシード特徴をプールすることで、この設定において説明された分散を保持しつつ、より安定なSAEを構築する。これらの結果は総合的に、不安定な特徴が単なる失敗した潜在変数やノイズの多い潜在変数ではなく、個々の機能的影響は弱いものの、標準的なSAEがシード間で異なる形で解決する再現可能な低次元構造を反映していることを示している。

English

Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question through feature stability: for each SAE feature, we estimate the probability that a similar feature reappears in an independently trained SAE. This yields a scalable per-feature signal that separates stable from unstable features. In a large-scale study across seeds, models, layers, dictionary sizes, and SAE variants, we find a pronounced functional asymmetry: stable features carry most of the reconstruction- and prediction-relevant signal, while unstable features have weak marginal impact and are dominated by low-frequency surface-form triggers in both activation statistics and automatic explanations. Geometrically, unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting that seed dependence often reflects basis ambiguity within a shared region of activation space rather than pure noise. A controlled synthetic model makes this mechanism explicit, showing that low-rank ground-truth features can be recovered at the subspace level while remaining non-identifiable as individual SAE latents across seeds. Finally, by pooling unique cross-seed features, we construct more stable SAEs while preserving explained variance in this setting. Together, these results show that unstable features are not merely failed or noisy latents: they have weak individual functional impact, but reflect reproducible low-dimensional structure that standard SAEs resolve differently across seeds.