ChatPaper.aiChatPaper

稀疏自编码器的合理性检验:SAE能否超越随机基线?

Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

February 15, 2026
作者: Anton Korznikov, Andrey Galichin, Alexey Dontsov, Oleg Rogov, Ivan Oseledets, Elena Tutubalina
cs.AI

摘要

稀疏自编码器(SAE)作为一种通过将神经网络激活分解为稀疏且人类可解释的特征集来解读神经网络的有力工具,近年来备受关注。最新研究提出了多种SAE变体,并成功将其扩展至前沿模型。尽管前景令人振奋,但下游任务中日益增多的负面结果对SAE是否真正提取了有意义特征提出了质疑。为直接探究这一问题,我们开展了两项互补性评估。在已知真实特征的合成实验环境中,我们发现SAE仅能还原9%的真实特征,而解释方差却达到71%,这表明即使重构效果出色,其核心任务仍存在失败。针对真实激活的评估,我们引入了三种基线方法,通过约束SAE特征方向或其激活模式为随机值进行对比。经过对多种SAE架构的大规模实验,结果显示我们的基线方法在可解释性(0.87对0.90)、稀疏探测(0.69对0.72)和因果编辑(0.73对0.72)指标上与完全训练的SAE表现相当。这些结果共同表明,当前状态的SAE尚不能可靠地分解模型的内部机制。
English
Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features. To directly investigate this, we perform two complementary evaluations. On a synthetic setup with known ground-truth features, we demonstrate that SAEs recover only 9% of true features despite achieving 71% explained variance, showing that they fail at their core task even when reconstruction is strong. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive experiments across multiple SAE architectures, we show that our baselines match fully-trained SAEs in interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72). Together, these results suggest that SAEs in their current state do not reliably decompose models' internal mechanisms.
PDF512February 19, 2026