FaithfulSAE: 外部データセット依存なしで忠実な特徴を捉えるためのスパースオートエンコーダ

要旨

スパースオートエンコーダ（SAE）は、大規模言語モデルの表現を解釈可能な特徴に分解するための有望な解決策として注目されている。しかし、PauloとBelrose（2025）は異なる初期化シード間での不安定性を指摘し、Heapら（2025）はSAEがモデル内部の特徴を捉えられない可能性があることを指摘している。これらの問題は、SAEを外部データセット（ウェブから収集されたものや他のモデルによって生成されたもの）で訓練することに起因している可能性が高い。これらのデータセットには、モデルの汎化能力を超えた分布外（OOD）データが含まれており、その結果、モデルの内部活性化を誤って表現する「偽の特徴（Fake Features）」と呼ばれるSAE特徴が生じる可能性がある。これらの問題に対処するため、我々はFaithfulSAEを提案する。これは、モデル自身の合成データセットを用いてSAEを訓練する手法である。FaithfulSAEを用いることで、OODが少ない指示データセットでSAEを訓練することで、シード間でのSAEの安定性が向上することを実証した。特に、FaithfulSAEはウェブベースのデータセットで訓練されたSAEをSAEプロービングタスクで上回り、7つのモデルのうち5つでより低い偽の特徴比率を示した。全体として、我々のアプローチは外部データセットへの依存を排除し、モデル内部の特徴をより適切に捉えることで解釈可能性を向上させるとともに、SAE訓練データセットの重要性がしばしば見過ごされている点を強調している。

English

Sparse Autoencoders (SAEs) have emerged as a promising solution for decomposing large language model representations into interpretable features. However, Paulo and Belrose (2025) have highlighted instability across different initialization seeds, and Heap et al. (2025) have pointed out that SAEs may not capture model-internal features. These problems likely stem from training SAEs on external datasets - either collected from the Web or generated by another model - which may contain out-of-distribution (OOD) data beyond the model's generalisation capabilities. This can result in hallucinated SAE features, which we term "Fake Features", that misrepresent the model's internal activations. To address these issues, we propose FaithfulSAE, a method that trains SAEs on the model's own synthetic dataset. Using FaithfulSAEs, we demonstrate that training SAEs on less-OOD instruction datasets results in SAEs being more stable across seeds. Notably, FaithfulSAEs outperform SAEs trained on web-based datasets in the SAE probing task and exhibit a lower Fake Feature Ratio in 5 out of 7 models. Overall, our approach eliminates the dependency on external datasets, advancing interpretability by better capturing model-internal features while highlighting the often neglected importance of SAE training datasets.

FaithfulSAE: 外部データセット依存なしで忠実な特徴を捉えるためのスパースオートエンコーダ

FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Dataset Dependencies

要旨

Support