Gemmaスコープ：Gemma 2におけるオープンスパースオートエンコーダの全領域同時適用

要旨

スパースオートエンコーダ（SAE）は、ニューラルネットワークの潜在表現を一見解釈可能な特徴にスパース分解するための教師なし学習手法である。その潜在的可能性に対する最近の注目にもかかわらず、産業界以外での研究応用は、包括的なSAEスイートを訓練するための高コストによって制限されている。本研究では、Gemma 2 2Bおよび9Bの全層とサブ層、ならびにGemma 2 27Bベースモデルの選択された層に対して訓練されたJumpReLU SAEのオープンスイートであるGemma Scopeを紹介する。主にGemma 2事前学習モデルに対してSAEを訓練するが、比較のために命令チューニングされたGemma 2 9Bに対して訓練されたSAEも公開する。各SAEの品質を標準的な指標で評価し、その結果を公開する。これらのSAEの重みを公開することで、コミュニティにとってより野心的な安全性と解釈可能性の研究を容易にすることを目指す。重みとチュートリアルはhttps://huggingface.co/google/gemma-scopeで、インタラクティブデモはhttps://www.neuronpedia.org/gemma-scopeで確認できる。

English

Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse decomposition of a neural network's latent representations into seemingly interpretable features. Despite recent excitement about their potential, research applications outside of industry are limited by the high cost of training a comprehensive suite of SAEs. In this work, we introduce Gemma Scope, an open suite of JumpReLU SAEs trained on all layers and sub-layers of Gemma 2 2B and 9B and select layers of Gemma 2 27B base models. We primarily train SAEs on the Gemma 2 pre-trained models, but additionally release SAEs trained on instruction-tuned Gemma 2 9B for comparison. We evaluate the quality of each SAE on standard metrics and release these results. We hope that by releasing these SAE weights, we can help make more ambitious safety and interpretability research easier for the community. Weights and a tutorial can be found at https://huggingface.co/google/gemma-scope and an interactive demo can be found at https://www.neuronpedia.org/gemma-scope

Gemmaスコープ：Gemma 2におけるオープンスパースオートエンコーダの全領域同時適用

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

要旨

Support