OrtSAE：正交稀疏自編碼器揭示原子特徵

摘要

稀疏自編碼器（Sparse Autoencoders, SAEs）是一種將神經網路激活分解為人類可解釋特徵的技術。然而，現有的SAEs存在特徵吸收問題，即專用特徵捕捉通用特徵的實例，導致表示空洞；以及特徵組合問題，即獨立特徵合併為複合表示。在本研究中，我們提出了正交稀疏自編碼器（Orthogonal SAE, OrtSAE），這是一種新方法，旨在通過強制學習特徵之間的正交性來緩解這些問題。通過實施一種新的訓練程序，該程序懲罰SAE特徵之間的高成對餘弦相似度，OrtSAE促進了特徵解耦的發展，同時隨著SAE規模的增大而線性擴展，避免了顯著的計算開銷。我們在不同模型和層次上訓練OrtSAE，並與其他方法進行比較。結果顯示，OrtSAE發現了多9%的獨特特徵，減少了特徵吸收（降低65%）和組合（降低15%），在去除虛假相關性方面提升了6%的性能，並且在其他下游任務上與傳統SAEs表現相當。

English

Sparse autoencoders (SAEs) are a technique for sparse decomposition of neural network activations into human-interpretable features. However, current SAEs suffer from feature absorption, where specialized features capture instances of general features creating representation holes, and feature composition, where independent features merge into composite representations. In this work, we introduce Orthogonal SAE (OrtSAE), a novel approach aimed to mitigate these issues by enforcing orthogonality between the learned features. By implementing a new training procedure that penalizes high pairwise cosine similarity between SAE features, OrtSAE promotes the development of disentangled features while scaling linearly with the SAE size, avoiding significant computational overhead. We train OrtSAE across different models and layers and compare it with other methods. We find that OrtSAE discovers 9% more distinct features, reduces feature absorption (by 65%) and composition (by 15%), improves performance on spurious correlation removal (+6%), and achieves on-par performance for other downstream tasks compared to traditional SAEs.

OrtSAE：正交稀疏自編碼器揭示原子特徵

OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features

摘要

Support