FaithfulSAE：实现无需外部数据集依赖的稀疏自编码器忠实特征提取

摘要

稀疏自编码器（SAEs）作为一种有前景的解决方案，正被用于将大规模语言模型的表示分解为可解释的特征。然而，Paulo与Belrose（2025年）指出，不同初始化种子间存在不稳定性，Heap等人（2025年）则发现SAEs可能无法捕捉模型内部特征。这些问题很可能源于在外部数据集上训练SAEs——这些数据集或采集自网络，或由另一模型生成——其中可能包含超出模型泛化能力的分布外（OOD）数据。这会导致SAEs产生我们称之为“虚假特征”的幻觉特征，错误反映模型内部激活。为解决这些问题，我们提出了FaithfulSAE方法，该方法在模型自身合成的数据集上训练SAEs。通过使用FaithfulSAEs，我们证明在较少OOD的指令数据集上训练SAEs能使其在不同种子间更加稳定。值得注意的是，在SAE探测任务中，FaithfulSAEs表现优于基于网络数据集训练的SAEs，并在7个模型中的5个展现出更低的虚假特征比率。总体而言，我们的方法消除了对外部数据集的依赖，通过更好地捕捉模型内部特征推进了可解释性研究，同时强调了SAE训练数据集常被忽视的重要性。

English

Sparse Autoencoders (SAEs) have emerged as a promising solution for decomposing large language model representations into interpretable features. However, Paulo and Belrose (2025) have highlighted instability across different initialization seeds, and Heap et al. (2025) have pointed out that SAEs may not capture model-internal features. These problems likely stem from training SAEs on external datasets - either collected from the Web or generated by another model - which may contain out-of-distribution (OOD) data beyond the model's generalisation capabilities. This can result in hallucinated SAE features, which we term "Fake Features", that misrepresent the model's internal activations. To address these issues, we propose FaithfulSAE, a method that trains SAEs on the model's own synthetic dataset. Using FaithfulSAEs, we demonstrate that training SAEs on less-OOD instruction datasets results in SAEs being more stable across seeds. Notably, FaithfulSAEs outperform SAEs trained on web-based datasets in the SAE probing task and exhibit a lower Fake Feature Ratio in 5 out of 7 models. Overall, our approach eliminates the dependency on external datasets, advancing interpretability by better capturing model-internal features while highlighting the often neglected importance of SAE training datasets.

FaithfulSAE：实现无需外部数据集依赖的稀疏自编码器忠实特征提取

FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Dataset Dependencies

摘要

Support