FaithfulSAE: 외부 데이터셋 의존 없이 신뢰할 수 있는 특징을 포착하기 위한 희소 오토인코더

초록

희소 오토인코더(Sparse Autoencoders, SAEs)는 대규모 언어 모델의 표현을 해석 가능한 특징으로 분해하기 위한 유망한 해결책으로 부상했다. 그러나 Paulo와 Belrose(2025)는 다양한 초기화 시드 간의 불안정성을 지적했으며, Heap 등(2025)은 SAEs가 모델 내부 특징을 제대로 포착하지 못할 수 있음을 지적했다. 이러한 문제는 웹에서 수집되거나 다른 모델에 의해 생성된 외부 데이터셋에서 SAEs를 훈련시키는 데서 비롯된 것으로 보이며, 이는 모델의 일반화 능력을 벗어난 분포 외(Out-of-Distribution, OOD) 데이터를 포함할 가능성이 있다. 이로 인해 모델의 내부 활성화를 잘못 표현하는 "가짜 특징(Fake Features)"이라 불리는 SAE 특징이 생성될 수 있다. 이러한 문제를 해결하기 위해, 우리는 모델 자체의 합성 데이터셋에서 SAEs를 훈련시키는 FaithfulSAE 방법을 제안한다. FaithfulSAE를 사용하여, 덜 OOD인 명령어 데이터셋에서 SAEs를 훈련시킬 경우 시드 간 안정성이 더 높아짐을 입증했다. 특히, FaithfulSAE는 웹 기반 데이터셋에서 훈련된 SAEs보다 SAE 탐색 작업에서 더 우수한 성능을 보였으며, 7개 모델 중 5개에서 더 낮은 가짜 특징 비율을 나타냈다. 전반적으로, 우리의 접근 방식은 외부 데이터셋에 대한 의존성을 제거함으로써 모델 내부 특징을 더 잘 포착하여 해석 가능성을 향상시키고, SAE 훈련 데이터셋의 중요성을 강조한다.

English

Sparse Autoencoders (SAEs) have emerged as a promising solution for decomposing large language model representations into interpretable features. However, Paulo and Belrose (2025) have highlighted instability across different initialization seeds, and Heap et al. (2025) have pointed out that SAEs may not capture model-internal features. These problems likely stem from training SAEs on external datasets - either collected from the Web or generated by another model - which may contain out-of-distribution (OOD) data beyond the model's generalisation capabilities. This can result in hallucinated SAE features, which we term "Fake Features", that misrepresent the model's internal activations. To address these issues, we propose FaithfulSAE, a method that trains SAEs on the model's own synthetic dataset. Using FaithfulSAEs, we demonstrate that training SAEs on less-OOD instruction datasets results in SAEs being more stable across seeds. Notably, FaithfulSAEs outperform SAEs trained on web-based datasets in the SAE probing task and exhibit a lower Fake Feature Ratio in 5 out of 7 models. Overall, our approach eliminates the dependency on external datasets, advancing interpretability by better capturing model-internal features while highlighting the often neglected importance of SAE training datasets.

FaithfulSAE: 외부 데이터셋 의존 없이 신뢰할 수 있는 특징을 포착하기 위한 희소 오토인코더

FaithfulSAE: Towards Capturing Faithful Features with Sparse Autoencoders without External Dataset Dependencies

초록

Support