불안정한 특징, 재현 가능한 부분공간: 희소 오토인코더에서 시드 의존성 이해

초록

희소 오토인코더(SAE)는 신경망 표현을 해석하기 위해 널리 사용되지만, 그 유용성은 학습된 특징들이 훈련 실행 간에 재현 가능한지 여부에 달려 있다. 우리는 특징 안정성을 통해 이 문제를 연구한다. 각 SAE 특징에 대해, 독립적으로 훈련된 SAE에서 유사한 특징이 다시 나타날 확률을 추정한다. 이는 확장 가능한 특징별 신호를 제공하여 안정적인 특징과 불안정한 특징을 구분한다. 시드, 모델, 계층, 사전 크기 및 SAE 변형 전반에 걸친 대규모 연구에서 우리는 뚜렷한 기능적 비대칭을 발견한다. 안정적인 특징은 재구성 및 예측 관련 신호의 대부분을 전달하는 반면, 불안정한 특징은 미미한 한계 영향을 가지며 활성화 통계와 자동 설명 모두에서 저빈도 표면 형태 트리거가 지배적이다. 기하학적으로, 불안정한 특징은 개별적으로는 재현 불가능하지만 재현 가능한 저차원 부분공간에 집중되어 있으며, 이는 시드 의존성이 종종 순수한 잡음보다는 활성화 공간의 공유된 영역 내에서의 기저 모호성을 반영함을 시사한다. 통제된 합성 모델은 이 메커니즘을 명시적으로 보여주며, 저차원 실제 특징들이 부분공간 수준에서는 복구 가능하지만 개별 SAE 잠재 변수로서는 시드 간에 식별 불가능한 상태로 남을 수 있음을 입증한다. 마지막으로, 교차 시드 고유 특징들을 통합함으로써 이 설정에서 설명된 분산을 유지하면서 더 안정적인 SAE를 구성한다. 종합하면, 이러한 결과는 불안정한 특징이 단순히 실패했거나 잡음이 많은 잠재 변수가 아님을 보여준다. 즉, 이들은 개별적인 기능적 영향이 약하지만 표준 SAE가 시드 간에 다르게 해결하는 재현 가능한 저차원 구조를 반영한다.

English

Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question through feature stability: for each SAE feature, we estimate the probability that a similar feature reappears in an independently trained SAE. This yields a scalable per-feature signal that separates stable from unstable features. In a large-scale study across seeds, models, layers, dictionary sizes, and SAE variants, we find a pronounced functional asymmetry: stable features carry most of the reconstruction- and prediction-relevant signal, while unstable features have weak marginal impact and are dominated by low-frequency surface-form triggers in both activation statistics and automatic explanations. Geometrically, unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting that seed dependence often reflects basis ambiguity within a shared region of activation space rather than pure noise. A controlled synthetic model makes this mechanism explicit, showing that low-rank ground-truth features can be recovered at the subspace level while remaining non-identifiable as individual SAE latents across seeds. Finally, by pooling unique cross-seed features, we construct more stable SAEs while preserving explained variance in this setting. Together, these results show that unstable features are not merely failed or noisy latents: they have weak individual functional impact, but reflect reproducible low-dimensional structure that standard SAEs resolve differently across seeds.