Gemma 範圍：在 Gemma 2 上一次性地開啟處處存在的開放稀疏自編碼器

摘要

稀疏自編碼器（SAEs）是一種無監督方法，用於學習神經網絡的潛在表示的稀疏分解，使其看起來具有可解釋的特徵。儘管最近對它們的潛力感到興奮，但在工業以外的研究應用受到訓練全面套件的SAEs成本高昂的限制。在這項工作中，我們介紹GemScope，這是一個開放套件，其中包括在Gemma 2 2B和9B的所有層和子層以及Gemma 2 27B基本模型的選擇層上訓練的JumpReLU SAEs。我們主要在Gem 2預訓練模型上訓練SAEs，但另外還釋出了在說明調整的Gemma 2 9B上訓練的SAEs，以供比較。我們根據標準指標評估每個SAE的質量並公佈這些結果。我們希望通過釋出這些SAE權重，能夠幫助社區更輕鬆地進行更具野心的安全性和可解釋性研究。權重和教程可在https://huggingface.co/google/gemma-scope找到，互動演示可在https://www.neuronpedia.org/gemma-scope找到。

English

Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse decomposition of a neural network's latent representations into seemingly interpretable features. Despite recent excitement about their potential, research applications outside of industry are limited by the high cost of training a comprehensive suite of SAEs. In this work, we introduce Gemma Scope, an open suite of JumpReLU SAEs trained on all layers and sub-layers of Gemma 2 2B and 9B and select layers of Gemma 2 27B base models. We primarily train SAEs on the Gemma 2 pre-trained models, but additionally release SAEs trained on instruction-tuned Gemma 2 9B for comparison. We evaluate the quality of each SAE on standard metrics and release these results. We hope that by releasing these SAE weights, we can help make more ambitious safety and interpretability research easier for the community. Weights and a tutorial can be found at https://huggingface.co/google/gemma-scope and an interactive demo can be found at https://www.neuronpedia.org/gemma-scope

Gemma 範圍：在 Gemma 2 上一次性地開啟處處存在的開放稀疏自編碼器

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

摘要

Support