FaithfulSAE：面向无外部数据集依赖的稀疏自编码器忠实特征提取

摘要

稀疏自編碼器（Sparse Autoencoders, SAEs）已成為將大型語言模型表示分解為可解釋特徵的一種有前景的解決方案。然而，Paulo與Belrose（2025年）指出，不同初始化種子間存在不穩定性，而Heap等人（2025年）則提出SAEs可能無法捕捉模型內部特徵。這些問題很可能源於SAEs在外部數據集上的訓練——這些數據集或從網絡收集，或由另一模型生成——其中可能包含超出模型泛化能力的分布外（Out-of-Distribution, OOD）數據。這可能導致SAE特徵的虛構，我們稱之為“虛假特徵”，其錯誤地代表了模型的內部激活。為解決這些問題，我們提出了FaithfulSAE，一種在模型自身合成數據集上訓練SAEs的方法。通過使用FaithfulSAEs，我們展示了在較少OOD的指令數據集上訓練SAEs，可使SAEs在不同種子間表現得更為穩定。值得注意的是，在SAE探測任務中，FaithfulSAEs優於基於網絡數據集訓練的SAEs，並在七個模型中的五個中展現出更低的虛假特徵比率。總體而言，我們的方法消除了對外部數據集的依賴，通過更好地捕捉模型內部特徵來推進可解釋性，同時強調了SAE訓練數據集常被忽視的重要性。

English

Sparse Autoencoders (SAEs) have emerged as a promising solution for decomposing large language model representations into interpretable features. However, Paulo and Belrose (2025) have highlighted instability across different initialization seeds, and Heap et al. (2025) have pointed out that SAEs may not capture model-internal features. These problems likely stem from training SAEs on external datasets - either collected from the Web or generated by another model - which may contain out-of-distribution (OOD) data beyond the model's generalisation capabilities. This can result in hallucinated SAE features, which we term "Fake Features", that misrepresent the model's internal activations. To address these issues, we propose FaithfulSAE, a method that trains SAEs on the model's own synthetic dataset. Using FaithfulSAEs, we demonstrate that training SAEs on less-OOD instruction datasets results in SAEs being more stable across seeds. Notably, FaithfulSAEs outperform SAEs trained on web-based datasets in the SAE probing task and exhibit a lower Fake Feature Ratio in 5 out of 7 models. Overall, our approach eliminates the dependency on external datasets, advancing interpretability by better capturing model-internal features while highlighting the often neglected importance of SAE training datasets.

FaithfulSAE：面向无外部数据集依赖的稀疏自编码器忠实特征提取

FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Dataset Dependencies

摘要

Support