CorrSteer: 相関ベースのスパースオートエンコーダ特徴選択によるLLMのタスク性能と安全性の向上

要旨

スパースオートエンコーダ（SAE）は、大規模言語モデル（LLM）から教師なしで解釈可能な特徴を抽出することができます。しかし、下流の制御タスクにおけるその有効性は、対照データセットや大規模な活性化ストレージの必要性によって制限されています。これらの制限に対処するため、我々はCorrSteerを提案します。これは、推論時に生成されたトークンからのSAE活性化とサンプルの正解率を相関させることで特徴を選択するアプローチです。この方法では、推論時の活性化のみを使用してより関連性の高い特徴を抽出し、それによって偽の相関を回避します。また、平均活性化から制御係数を取得することで、パイプライン全体を自動化します。我々の手法は、Gemma 2 2BおよびLLaMA 3.1 8Bにおいて、QA、バイアス軽減、ジェイルブレイク防止、推論ベンチマークで改善されたタスク性能を示し、特にMMLU性能で+4.1%、HarmBenchで+22.9%の改善をわずか4000サンプルで達成しました。選択された特徴は、各タスクの要件に沿った意味的に有意なパターンを示し、性能を駆動する基盤となる能力を明らかにします。我々の研究は、相関ベースの選択が言語モデルアプリケーション全体での自動化されたSAE制御に対する効果的でスケーラブルなアプローチであることを確立します。

English

Sparse Autoencoders (SAEs) can extract interpretable features from large language models (LLMs) without supervision. However, their effectiveness in downstream steering tasks is limited by the requirement for contrastive datasets or large activation storage. To address these limitations, we propose CorrSteer, which selects features by correlating sample correctness with SAE activations from generated tokens at inference time. This approach uses only inference-time activations to extract more relevant features, thereby avoiding spurious correlations. It also obtains steering coefficients from average activations, automating the entire pipeline. Our method shows improved task performance on QA, bias mitigation, jailbreaking prevention, and reasoning benchmarks on Gemma 2 2B and LLaMA 3.1 8B, notably achieving a +4.1% improvement in MMLU performance and a +22.9% improvement in HarmBench with only 4000 samples. Selected features demonstrate semantically meaningful patterns aligned with each task's requirements, revealing the underlying capabilities that drive performance. Our work establishes correlationbased selection as an effective and scalable approach for automated SAE steering across language model applications.

CorrSteer: 相関ベースのスパースオートエンコーダ特徴選択によるLLMのタスク性能と安全性の向上

CorrSteer: Steering Improves Task Performance and Safety in LLMs through Correlation-based Sparse Autoencoder Feature Selection

要旨

Support