透過提升技術賦予舊有稀疏自編碼器新領域的應用技巧

摘要

稀疏自編碼器已成為解讀大型語言模型內部表徵的強大工具，然而它們往往無法捕捉訓練語料中不常見的領域特定特徵。本文提出了一種殘差學習方法，旨在解決這種特徵盲區問題，而無需進行完整的重新訓練。我們建議訓練一個次級稀疏自編碼器，專門用於建模預訓練稀疏自編碼器在領域特定文本上的重建誤差，從而有效捕捉主模型遺漏的特徵。通過在推理過程中將兩個模型的輸出相加，我們在多個專業領域中展示了在大型語言模型交叉熵和解釋方差指標上的顯著改進。實驗表明，該方法能有效地將新領域知識融入現有稀疏自編碼器中，同時保持其在通用任務上的性能。這一方法使研究人員能夠有選擇性地增強稀疏自編碼器在特定領域的可解釋性，為大型語言模型的定向機制解釋開闢了新的可能性。

English

Sparse Autoencoders have emerged as powerful tools for interpreting the internal representations of Large Language Models, yet they often fail to capture domain-specific features not prevalent in their training corpora. This paper introduces a residual learning approach that addresses this feature blindness without requiring complete retraining. We propose training a secondary SAE specifically to model the reconstruction error of a pretrained SAE on domain-specific texts, effectively capturing features missed by the primary model. By summing the outputs of both models during inference, we demonstrate significant improvements in both LLM cross-entropy and explained variance metrics across multiple specialized domains. Our experiments show that this method efficiently incorporates new domain knowledge into existing SAEs while maintaining their performance on general tasks. This approach enables researchers to selectively enhance SAE interpretability for specific domains of interest, opening new possibilities for targeted mechanistic interpretability of LLMs.

透過提升技術賦予舊有稀疏自編碼器新領域的應用技巧

Teach Old SAEs New Domain Tricks with Boosting

摘要

Support