GemScope:在Gemma 2上一次性开放稀疏自编码器
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
August 9, 2024
作者: Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, Neel Nanda
cs.AI
摘要
稀疏自编码器(SAEs)是一种无监督方法,用于学习神经网络潜在表示的稀疏分解,将其转化为看似可解释的特征。尽管最近人们对其潜力感到兴奋,但在工业之外的研究应用受到训练全面套件的高成本限制。在这项工作中,我们介绍了 Gemma Scope,这是一个开放套件,其中包括在 Gemma 2 2B 和 9B 的所有层和子层以及 Gemma 2 27B 基础模型的部分层上训练的 JumpReLU SAEs。我们主要在 Gemma 2 预训练模型上训练 SAEs,但另外还发布了在经过指导调整的 Gemma 2 9B 上训练的 SAEs 以供比较。我们评估了每个 SAE 的质量,并发布了这些结果。我们希望通过发布这些 SAE 权重,能够帮助社区更轻松地进行更具野心的安全性和可解释性研究。权重和教程可在 https://huggingface.co/google/gemma-scope 找到,交互式演示可在 https://www.neuronpedia.org/gemma-scope 找到。
English
Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse
decomposition of a neural network's latent representations into seemingly
interpretable features. Despite recent excitement about their potential,
research applications outside of industry are limited by the high cost of
training a comprehensive suite of SAEs. In this work, we introduce Gemma Scope,
an open suite of JumpReLU SAEs trained on all layers and sub-layers of Gemma 2
2B and 9B and select layers of Gemma 2 27B base models. We primarily train SAEs
on the Gemma 2 pre-trained models, but additionally release SAEs trained on
instruction-tuned Gemma 2 9B for comparison. We evaluate the quality of each
SAE on standard metrics and release these results. We hope that by releasing
these SAE weights, we can help make more ambitious safety and interpretability
research easier for the community. Weights and a tutorial can be found at
https://huggingface.co/google/gemma-scope and an interactive demo can be found
at https://www.neuronpedia.org/gemma-scopeSummary
AI-Generated Summary