앞서 나아가기: JumpReLU 희소 오토인코더를 통한 재구성 정확도 향상

초록

희소 오토인코더(SAE)는 언어 모델(LM)의 활성화에서 인과적으로 관련 있고 해석 가능한 선형 특성을 식별하기 위한 유망한 비지도 학습 접근법입니다. 하위 작업에 유용하기 위해서는 SAE가 LM 활성화를 충실하게 분해해야 하지만, 해석 가능성을 위해서는 분해가 희소해야 합니다. 이 두 목표는 상충 관계에 있습니다. 본 논문에서는 JumpReLU SAE를 소개하며, 이는 Gemma 2 9B 활성화에서 주어진 희소성 수준에서 최첨단 재구성 정확도를 달성합니다. 이는 Gated 및 TopK SAE와 같은 최근의 다른 발전과 비교하여 이루어진 결과입니다. 또한, 이 개선이 해석 가능성을 희생하지 않음을 수동 및 자동화된 해석 가능성 연구를 통해 보여줍니다. JumpReLU SAE는 기본(ReLU) SAE의 간단한 변형으로, ReLU를 불연속적인 JumpReLU 활성화 함수로 대체하며, 학습 및 실행에 있어서도 비슷한 효율성을 유지합니다. 직통 추정기(STEs)를 원칙적으로 활용함으로써, SAE의 순전파 과정에서 도입된 불연속적인 JumpReLU 함수에도 불구하고 JumpReLU SAE를 효과적으로 학습할 수 있음을 보여줍니다. 마찬가지로, STEs를 사용하여 L1과 같은 대리자 대신 L0를 직접 희소하게 학습함으로써 수축과 같은 문제를 피합니다.

English

Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM) activations. To be useful for downstream tasks, SAEs need to decompose LM activations faithfully; yet to be interpretable the decomposition must be sparse -- two objectives that are in tension. In this paper, we introduce JumpReLU SAEs, which achieve state-of-the-art reconstruction fidelity at a given sparsity level on Gemma 2 9B activations, compared to other recent advances such as Gated and TopK SAEs. We also show that this improvement does not come at the cost of interpretability through manual and automated interpretability studies. JumpReLU SAEs are a simple modification of vanilla (ReLU) SAEs -- where we replace the ReLU with a discontinuous JumpReLU activation function -- and are similarly efficient to train and run. By utilising straight-through-estimators (STEs) in a principled manner, we show how it is possible to train JumpReLU SAEs effectively despite the discontinuous JumpReLU function introduced in the SAE's forward pass. Similarly, we use STEs to directly train L0 to be sparse, instead of training on proxies such as L1, avoiding problems like shrinkage.

앞서 나아가기: JumpReLU 희소 오토인코더를 통한 재구성 정확도 향상

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

초록

Support