특징 상관관계를 활용하여 희소 오토인코더를 효율적으로 학습하기

초록

희소 오토인코더(SAE)는 언어 모델의 은닉 상태를 해석 가능한 잠재 방향으로 분해함으로써 상당한 잠재력을 보여주었습니다. 그러나 대규모 사전 크기를 사용할 때, SAE의 학습은 여전히 어려운 과제로 남아 있습니다. 디코더는 희소성 인식 커널을 활용하여 효율성을 높일 수 있지만, 인코더는 여전히 큰 출력 차원을 가진 계산 집약적인 선형 연산을 필요로 합니다. 이를 해결하기 위해, 우리는 Kronecker 곱 분해를 통해 잠재 표현을 인수분해하는 새로운 아키텍처인 KronSAE를 제안합니다. 이는 메모리와 계산 오버헤드를 크게 줄입니다. 더 나아가, 우리는 이진 AND 연산을 근사하는 미분 가능한 활성화 함수인 mAND를 소개합니다. 이는 우리의 인수분해된 프레임워크에서 해석 가능성과 성능을 향상시킵니다.

English

Sparse Autoencoders (SAEs) have demonstrated significant promise in interpreting the hidden states of language models by decomposing them into interpretable latent directions. However, training SAEs at scale remains challenging, especially when large dictionary sizes are used. While decoders can leverage sparse-aware kernels for efficiency, encoders still require computationally intensive linear operations with large output dimensions. To address this, we propose KronSAE, a novel architecture that factorizes the latent representation via Kronecker product decomposition, drastically reducing memory and computational overhead. Furthermore, we introduce mAND, a differentiable activation function approximating the binary AND operation, which improves interpretability and performance in our factorized framework.

특징 상관관계를 활용하여 희소 오토인코더를 효율적으로 학습하기

Train Sparse Autoencoders Efficiently by Utilizing Features Correlation

초록

Support