不穩定特徵，可重現子空間：理解稀疏自編碼器中的種子依賴性

摘要

稀疏自動編碼器（SAE）廣泛用於解釋神經網路表徵，但其實用性取決於學習到的特徵是否能跨訓練過程再現。我們透過特徵穩定性來研究此問題：針對每個SAE特徵，我們估計其在獨立訓練的SAE中出現相似特徵的機率。這提供了可擴展的逐特徵訊號，用以區分穩定與不穩定的特徵。在涵蓋不同隨機種子、模型、層、字典大小及SAE變體的大規模研究中，我們發現顯著的功能不對稱性：穩定特徵承載了大部分與重建及預測相關的訊號，而不穩定特徵的邊際影響微弱，且在激活統計與自動解釋中皆以低頻表面形式觸發器為主。從幾何角度來看，不穩定特徵個別無法再現，但卻集中在可再現的低秩子空間中，這顯示種子依賴性往往反映的是共享激活空間內的基底歧義性，而非純粹雜訊。一個受控的合成模型明確展現此機制：低秩的真實特徵可在子空間層級被還原，但跨種子時作為個別SAE潛在變項仍無法辨識。最後，透過匯集跨種子的獨特特徵，我們在此設定下建構出更穩定的SAE，同時保留解釋變異量。綜上所述，這些結果顯示不穩定特徵不僅僅是失敗或帶雜訊的潛在變項：它們個別功能影響微弱，但反映了可再現的低維結構，而標準SAE會因種子不同而以不同方式解析此結構。

English

Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question through feature stability: for each SAE feature, we estimate the probability that a similar feature reappears in an independently trained SAE. This yields a scalable per-feature signal that separates stable from unstable features. In a large-scale study across seeds, models, layers, dictionary sizes, and SAE variants, we find a pronounced functional asymmetry: stable features carry most of the reconstruction- and prediction-relevant signal, while unstable features have weak marginal impact and are dominated by low-frequency surface-form triggers in both activation statistics and automatic explanations. Geometrically, unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting that seed dependence often reflects basis ambiguity within a shared region of activation space rather than pure noise. A controlled synthetic model makes this mechanism explicit, showing that low-rank ground-truth features can be recovered at the subspace level while remaining non-identifiable as individual SAE latents across seeds. Finally, by pooling unique cross-seed features, we construct more stable SAEs while preserving explained variance in this setting. Together, these results show that unstable features are not merely failed or noisy latents: they have weak individual functional impact, but reflect reproducible low-dimensional structure that standard SAEs resolve differently across seeds.