ブースティングで古いSAEに新しいドメインの技を教える

要旨

スパースオートエンコーダ（SAE）は、大規模言語モデル（LLM）の内部表現を解釈するための強力なツールとして注目を集めているが、訓練コーパスに広く存在しないドメイン固有の特徴を捉えることができない場合が多い。本論文では、完全な再訓練を必要とせずにこの特徴の欠落を解決する残差学習アプローチを提案する。具体的には、事前学習済みSAEのドメイン固有テキストに対する再構成誤差をモデル化するために、二次的なSAEを訓練する。推論時に両モデルの出力を合計することで、複数の専門領域においてLLMのクロスエントロピーと説明分散指標の大幅な改善を実証する。実験結果から、この手法が既存のSAEに新しいドメイン知識を効率的に組み込みながら、一般的なタスクでの性能を維持することが示された。このアプローチにより、研究者は特定の関心領域に対してSAEの解釈可能性を選択的に強化することが可能となり、LLMのターゲット指向のメカニズム的解釈可能性に新たな可能性を開くものである。

English

Sparse Autoencoders have emerged as powerful tools for interpreting the internal representations of Large Language Models, yet they often fail to capture domain-specific features not prevalent in their training corpora. This paper introduces a residual learning approach that addresses this feature blindness without requiring complete retraining. We propose training a secondary SAE specifically to model the reconstruction error of a pretrained SAE on domain-specific texts, effectively capturing features missed by the primary model. By summing the outputs of both models during inference, we demonstrate significant improvements in both LLM cross-entropy and explained variance metrics across multiple specialized domains. Our experiments show that this method efficiently incorporates new domain knowledge into existing SAEs while maintaining their performance on general tasks. This approach enables researchers to selectively enhance SAE interpretability for specific domains of interest, opening new possibilities for targeted mechanistic interpretability of LLMs.

ブースティングで古いSAEに新しいドメインの技を教える

Teach Old SAEs New Domain Tricks with Boosting

要旨

Support