不稳定的特征，可复现的子空间：理解稀疏自编码器中的种子依赖性

摘要

稀疏自编码器（SAE）被广泛用于解释神经网络表征，但其效用取决于学习到的特征是否能在不同训练轮次间复现。我们通过特征稳定性研究这一问题：对每个SAE特征，我们评估其在独立训练的SAE中出现相似特征的概率。这一方法为每个特征提供了可扩展的信号，从而区分稳定特征与不稳定特征。在跨随机种子、模型、层、字典大小及SAE变体的大规模研究中，我们发现显著的功能不对称性：稳定特征承载了大部分与重建和预测相关的信号，而不稳定特征则具有微弱的边际影响，并且在激活统计和自动解释中主要由低频表面形式触发因素主导。从几何角度看，不稳定特征虽然个体不可复现，但却集中在可复现的低秩子空间中，这表明对种子的依赖性往往反映了共享激活空间区域内的基模糊性，而非纯粹噪声。一个受控的合成模型明确揭示了这一机制，表明低秩的真实特征可以在子空间层面被恢复，但作为跨种子的个体SAE潜变量却无法被识别。最后，通过汇集跨种子的独特特征，我们构建了更稳定的SAE，同时在此设置下保留了已解释方差。这些结果共同表明，不稳定特征并非仅仅是失败或噪声潜变量：它们个体功能影响微弱，但反映了可复现的低维结构，而标准SAE在不同的种子下以不同方式解析这些结构。

English

Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question through feature stability: for each SAE feature, we estimate the probability that a similar feature reappears in an independently trained SAE. This yields a scalable per-feature signal that separates stable from unstable features. In a large-scale study across seeds, models, layers, dictionary sizes, and SAE variants, we find a pronounced functional asymmetry: stable features carry most of the reconstruction- and prediction-relevant signal, while unstable features have weak marginal impact and are dominated by low-frequency surface-form triggers in both activation statistics and automatic explanations. Geometrically, unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting that seed dependence often reflects basis ambiguity within a shared region of activation space rather than pure noise. A controlled synthetic model makes this mechanism explicit, showing that low-rank ground-truth features can be recovered at the subspace level while remaining non-identifiable as individual SAE latents across seeds. Finally, by pooling unique cross-seed features, we construct more stable SAEs while preserving explained variance in this setting. Together, these results show that unstable features are not merely failed or noisy latents: they have weak individual functional impact, but reflect reproducible low-dimensional structure that standard SAEs resolve differently across seeds.