通过利用特征相关性高效训练稀疏自编码器

摘要

稀疏自编码器（SAEs）在解析语言模型隐藏状态方面展现出显著潜力，通过将其分解为可解释的潜在方向。然而，大规模训练SAEs仍面临挑战，尤其是在使用大词典规模时。尽管解码器可利用稀疏感知内核提升效率，但编码器仍需执行计算密集型的线性运算，且输出维度庞大。为此，我们提出了KronSAE，一种通过克罗内克积分解因子化潜在表示的新架构，大幅降低了内存和计算开销。此外，我们引入了mAND，一种近似二进制AND操作的可微分激活函数，进一步提升了因子化框架下的可解释性和性能。

English

Sparse Autoencoders (SAEs) have demonstrated significant promise in interpreting the hidden states of language models by decomposing them into interpretable latent directions. However, training SAEs at scale remains challenging, especially when large dictionary sizes are used. While decoders can leverage sparse-aware kernels for efficiency, encoders still require computationally intensive linear operations with large output dimensions. To address this, we propose KronSAE, a novel architecture that factorizes the latent representation via Kronecker product decomposition, drastically reducing memory and computational overhead. Furthermore, we introduce mAND, a differentiable activation function approximating the binary AND operation, which improves interpretability and performance in our factorized framework.

通过利用特征相关性高效训练稀疏自编码器

Train Sparse Autoencoders Efficiently by Utilizing Features Correlation

摘要

Support