OrtSAE：直交スパースオートエンコーダによる原子レベルの特徴抽出

要旨

スパースオートエンコーダ（SAE）は、ニューラルネットワークの活性化を人間が解釈可能な特徴にスパース分解する技術です。しかし、現在のSAEは、特定の特徴が一般的な特徴のインスタンスを捕捉して表現の穴を作る「特徴吸収」や、独立した特徴が複合表現に融合する「特徴合成」といった問題に悩まされています。本研究では、これらの問題を緩和するために、学習された特徴間の直交性を強制する新しいアプローチであるOrthogonal SAE（OrtSAE）を提案します。SAE特徴間の高いコサイン類似度をペナルティとする新しいトレーニング手順を導入することで、OrtSAEは解離された特徴の発展を促進し、SAEのサイズに対して線形にスケーリングするため、大きな計算オーバーヘッドを回避します。異なるモデルや層でOrtSAEをトレーニングし、他の手法と比較しました。その結果、OrtSAEは9%多くの異なる特徴を発見し、特徴吸収を65%、特徴合成を15%削減し、偽相関除去の性能を6%向上させ、従来のSAEと同等の性能を他の下流タスクで達成することがわかりました。

English

Sparse autoencoders (SAEs) are a technique for sparse decomposition of neural network activations into human-interpretable features. However, current SAEs suffer from feature absorption, where specialized features capture instances of general features creating representation holes, and feature composition, where independent features merge into composite representations. In this work, we introduce Orthogonal SAE (OrtSAE), a novel approach aimed to mitigate these issues by enforcing orthogonality between the learned features. By implementing a new training procedure that penalizes high pairwise cosine similarity between SAE features, OrtSAE promotes the development of disentangled features while scaling linearly with the SAE size, avoiding significant computational overhead. We train OrtSAE across different models and layers and compare it with other methods. We find that OrtSAE discovers 9% more distinct features, reduces feature absorption (by 65%) and composition (by 15%), improves performance on spurious correlation removal (+6%), and achieves on-par performance for other downstream tasks compared to traditional SAEs.

OrtSAE：直交スパースオートエンコーダによる原子レベルの特徴抽出

OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features

要旨

Support