稀疏自编码器的合理性檢驗:SAE是否勝過隨機基線?
Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?
February 15, 2026
作者: Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Rogov, Ivan Oseledets, Elena Tutubalina
cs.AI
摘要
稀疏自編碼器(SAE)已成為解釋神經網絡的重要工具,其透過將網絡激活分解為稀疏且人類可解讀的特徵集來實現這一目標。近期研究提出了多種SAE變體,並成功將其擴展至前沿模型。儘管引發廣泛關注,但下游任務中日益增多的負面結果使人們質疑SAE是否真能提取出有意義的特徵。為直接探究此問題,我們進行了兩項互補的評估:在具備已知真實特徵的合成設定中,我們證明SAE僅能復原9%的真實特徵,儘管其解釋方差達到71%,顯示即使重構效果強勁,SAE仍未能完成核心任務。針對真實激活的評估,我們引入三種基準方法,將SAE特徵方向或其激活模式約束為隨機值。通過對多種SAE架構的廣泛實驗,我們發現這些基準方法在可解釋性(0.87對比0.90)、稀疏探測(0.69對比0.72)與因果編輯(0.73對比0.72)方面與完整訓練的SAE表現相當。這些結果共同表明,當前階段的SAE並不能可靠地分解模型的內部機制。
English
Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features. To directly investigate this, we perform two complementary evaluations. On a synthetic setup with known ground-truth features, we demonstrate that SAEs recover only 9% of true features despite achieving 71% explained variance, showing that they fail at their core task even when reconstruction is strong. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive experiments across multiple SAE architectures, we show that our baselines match fully-trained SAEs in interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72). Together, these results suggest that SAEs in their current state do not reliably decompose models' internal mechanisms.