利用特徵相關性高效訓練稀疏自編碼器

摘要

稀疏自編碼器（SAEs）在解釋語言模型隱藏狀態方面展現出顯著潛力，通過將其分解為可解釋的潛在方向。然而，大規模訓練SAEs仍具挑戰性，尤其是在使用大型字典時。雖然解碼器可以利用稀疏感知內核來提高效率，但編碼器仍需進行計算密集型的線性操作，且輸出維度龐大。為解決這一問題，我們提出了KronSAE，一種新穎的架構，通過克羅內克積分解來因子化潛在表示，大幅降低了記憶體和計算開銷。此外，我們引入了mAND，一種近似二進制AND操作的可微分激活函數，這在我們的因子化框架中提升了可解釋性和性能。

English

Sparse Autoencoders (SAEs) have demonstrated significant promise in interpreting the hidden states of language models by decomposing them into interpretable latent directions. However, training SAEs at scale remains challenging, especially when large dictionary sizes are used. While decoders can leverage sparse-aware kernels for efficiency, encoders still require computationally intensive linear operations with large output dimensions. To address this, we propose KronSAE, a novel architecture that factorizes the latent representation via Kronecker product decomposition, drastically reducing memory and computational overhead. Furthermore, we introduce mAND, a differentiable activation function approximating the binary AND operation, which improves interpretability and performance in our factorized framework.

利用特徵相關性高效訓練稀疏自編碼器

Train Sparse Autoencoders Efficiently by Utilizing Features Correlation

摘要

Support