特徴量の相関を活用した効率的なスパースオートエンコーダの学習

要旨

スパースオートエンコーダ（SAE）は、言語モデルの隠れ状態を解釈可能な潜在方向に分解することで、その解釈可能性において大きな可能性を示してきた。しかし、大規模な辞書サイズを使用する場合、SAEの訓練は依然として困難である。デコーダは効率化のためにスパース対応カーネルを活用できるが、エンコーダは依然として大規模な出力次元を伴う計算集約的な線形演算を必要とする。この問題に対処するため、我々はKronSAEを提案する。これは、クロネッカー積分解を介して潜在表現を因数分解し、メモリと計算のオーバーヘッドを大幅に削減する新しいアーキテクチャである。さらに、二値AND演算を近似する微分可能な活性化関数であるmANDを導入し、因数分解されたフレームワークにおける解釈可能性と性能を向上させる。

English

Sparse Autoencoders (SAEs) have demonstrated significant promise in interpreting the hidden states of language models by decomposing them into interpretable latent directions. However, training SAEs at scale remains challenging, especially when large dictionary sizes are used. While decoders can leverage sparse-aware kernels for efficiency, encoders still require computationally intensive linear operations with large output dimensions. To address this, we propose KronSAE, a novel architecture that factorizes the latent representation via Kronecker product decomposition, drastically reducing memory and computational overhead. Furthermore, we introduce mAND, a differentiable activation function approximating the binary AND operation, which improves interpretability and performance in our factorized framework.

特徴量の相関を活用した効率的なスパースオートエンコーダの学習

Train Sparse Autoencoders Efficiently by Utilizing Features Correlation

要旨

Support