借助提升技术,让传统稀疏自编码器掌握新领域技能
Teach Old SAEs New Domain Tricks with Boosting
July 17, 2025
作者: Nikita Koriagin, Yaroslav Aksenov, Daniil Laptev, Gleb Gerasimov, Nikita Balagansky, Daniil Gavrilov
cs.AI
摘要
稀疏自编码器(Sparse Autoencoders, SAEs)已成为解析大型语言模型内部表征的有力工具,然而它们往往难以捕捉训练语料库中不常见的领域特定特征。本文提出了一种残差学习方法,旨在无需完全重新训练的情况下解决这一特征盲区问题。我们建议训练一个辅助SAE,专门用于建模预训练SAE在领域特定文本上的重构误差,从而有效捕获主模型遗漏的特征。通过在推理阶段将两个模型的输出相加,我们在多个专业领域中显著提升了大型语言模型的交叉熵和解释方差指标。实验表明,该方法能够高效地将新领域知识融入现有SAE,同时保持其在通用任务上的性能。这一方法使研究人员能够针对特定兴趣领域有选择性地增强SAE的可解释性,为大型语言模型的定向机制解释开辟了新的可能性。
English
Sparse Autoencoders have emerged as powerful tools for interpreting the
internal representations of Large Language Models, yet they often fail to
capture domain-specific features not prevalent in their training corpora. This
paper introduces a residual learning approach that addresses this feature
blindness without requiring complete retraining. We propose training a
secondary SAE specifically to model the reconstruction error of a pretrained
SAE on domain-specific texts, effectively capturing features missed by the
primary model. By summing the outputs of both models during inference, we
demonstrate significant improvements in both LLM cross-entropy and explained
variance metrics across multiple specialized domains. Our experiments show that
this method efficiently incorporates new domain knowledge into existing SAEs
while maintaining their performance on general tasks. This approach enables
researchers to selectively enhance SAE interpretability for specific domains of
interest, opening new possibilities for targeted mechanistic interpretability
of LLMs.