Gemma 範圍:在 Gemma 2 上一次性地開啟處處存在的開放稀疏自編碼器
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
August 9, 2024
作者: Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, Neel Nanda
cs.AI
摘要
稀疏自編碼器(SAEs)是一種無監督方法,用於學習神經網絡的潛在表示的稀疏分解,使其看起來具有可解釋的特徵。儘管最近對它們的潛力感到興奮,但在工業以外的研究應用受到訓練全面套件的SAEs成本高昂的限制。在這項工作中,我們介紹GemScope,這是一個開放套件,其中包括在Gemma 2 2B和9B的所有層和子層以及Gemma 2 27B基本模型的選擇層上訓練的JumpReLU SAEs。我們主要在Gem 2預訓練模型上訓練SAEs,但另外還釋出了在說明調整的Gemma 2 9B上訓練的SAEs,以供比較。我們根據標準指標評估每個SAE的質量並公佈這些結果。我們希望通過釋出這些SAE權重,能夠幫助社區更輕鬆地進行更具野心的安全性和可解釋性研究。權重和教程可在https://huggingface.co/google/gemma-scope找到,互動演示可在https://www.neuronpedia.org/gemma-scope找到。
English
Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse
decomposition of a neural network's latent representations into seemingly
interpretable features. Despite recent excitement about their potential,
research applications outside of industry are limited by the high cost of
training a comprehensive suite of SAEs. In this work, we introduce Gemma Scope,
an open suite of JumpReLU SAEs trained on all layers and sub-layers of Gemma 2
2B and 9B and select layers of Gemma 2 27B base models. We primarily train SAEs
on the Gemma 2 pre-trained models, but additionally release SAEs trained on
instruction-tuned Gemma 2 9B for comparison. We evaluate the quality of each
SAE on standard metrics and release these results. We hope that by releasing
these SAE weights, we can help make more ambitious safety and interpretability
research easier for the community. Weights and a tutorial can be found at
https://huggingface.co/google/gemma-scope and an interactive demo can be found
at https://www.neuronpedia.org/gemma-scopeSummary
AI-Generated Summary