OrtSAE：正交稀疏自编码器揭示原子特征

摘要

稀疏自编码器（SAEs）是一种将神经网络激活分解为人类可解释特征的稀疏表示技术。然而，现有的SAEs存在特征吸收问题，即特定特征捕捉到通用特征的实例，导致表示空洞；以及特征组合问题，即独立特征融合为复合表示。在本研究中，我们提出了一种新颖的方法——正交稀疏自编码器（OrtSAE），旨在通过强制学习特征间的正交性来缓解这些问题。通过实施一种新的训练过程，该过程惩罚SAE特征间的高余弦相似度，OrtSAE促进了特征解耦的发展，同时其计算复杂度随SAE规模线性增长，避免了显著的计算开销。我们在不同模型和层次上训练OrtSAE，并与其他方法进行比较。结果表明，OrtSAE发现了更多（9%）的独特特征，减少了特征吸收（降低65%）和组合（降低15%），在去除虚假相关性任务上提升了6%的性能，并在其他下游任务中与传统SAEs保持相当的表现。

English

Sparse autoencoders (SAEs) are a technique for sparse decomposition of neural network activations into human-interpretable features. However, current SAEs suffer from feature absorption, where specialized features capture instances of general features creating representation holes, and feature composition, where independent features merge into composite representations. In this work, we introduce Orthogonal SAE (OrtSAE), a novel approach aimed to mitigate these issues by enforcing orthogonality between the learned features. By implementing a new training procedure that penalizes high pairwise cosine similarity between SAE features, OrtSAE promotes the development of disentangled features while scaling linearly with the SAE size, avoiding significant computational overhead. We train OrtSAE across different models and layers and compare it with other methods. We find that OrtSAE discovers 9% more distinct features, reduces feature absorption (by 65%) and composition (by 15%), improves performance on spurious correlation removal (+6%), and achieves on-par performance for other downstream tasks compared to traditional SAEs.

OrtSAE：正交稀疏自编码器揭示原子特征

OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features

摘要

Support