OrtSAE: 원자적 특성을 규명하는 직교 희소 오토인코더

초록

희소 오토인코더(SAE)는 신경망 활성화를 인간이 해석 가능한 특징으로 희소 분해하는 기술이다. 그러나 현재의 SAE는 특화된 특징이 일반적인 특징의 사례를 포착하여 표현 공백을 생성하는 특징 흡수(feature absorption)와 독립적인 특징이 복합 표현으로 병합되는 특징 구성(feature composition) 문제를 겪고 있다. 본 연구에서는 이러한 문제를 완화하기 위해 학습된 특징 간 직교성을 강제하는 새로운 접근법인 직교 SAE(OrtSAE)를 소개한다. SAE 특징 간의 높은 코사인 유사도를 벌점화하는 새로운 훈련 절차를 구현함으로써, OrtSAE는 SAE 크기에 선형적으로 비례하면서도 상당한 계산 오버헤드를 피하며 분리된 특징의 개발을 촉진한다. 다양한 모델과 계층에 걸쳐 OrtSAE를 훈련하고 다른 방법과 비교한 결과, OrtSAE는 9% 더 많은 독특한 특징을 발견하고, 특징 흡수를 65%, 특징 구성을 15% 감소시키며, 허위 상관관계 제거에서 6% 더 나은 성능을 보였고, 기존 SAE와 비교하여 다른 다운스트림 작업에서 동등한 성능을 달성했다.

English

Sparse autoencoders (SAEs) are a technique for sparse decomposition of neural network activations into human-interpretable features. However, current SAEs suffer from feature absorption, where specialized features capture instances of general features creating representation holes, and feature composition, where independent features merge into composite representations. In this work, we introduce Orthogonal SAE (OrtSAE), a novel approach aimed to mitigate these issues by enforcing orthogonality between the learned features. By implementing a new training procedure that penalizes high pairwise cosine similarity between SAE features, OrtSAE promotes the development of disentangled features while scaling linearly with the SAE size, avoiding significant computational overhead. We train OrtSAE across different models and layers and compare it with other methods. We find that OrtSAE discovers 9% more distinct features, reduces feature absorption (by 65%) and composition (by 15%), improves performance on spurious correlation removal (+6%), and achieves on-par performance for other downstream tasks compared to traditional SAEs.

OrtSAE: 원자적 특성을 규명하는 직교 희소 오토인코더

OrtSAE: Orthogonal Sparse Autoencoders Uncover Atomic Features

초록

Support