ロウグ・スカルペル：アクティベーション・ステアリングによるLLM安全性の侵害

要旨

活性化ステアリングは、推論時にモデルの隠れ状態に意味的に有意なベクトルを直接追加することでLLMの挙動を制御する有望な技術である。これは、ファインチューニングに比べて精密で解釈可能かつ潜在的に安全な代替手法として位置づけられることが多い。しかし、我々はその逆を示す：ステアリングは体系的にモデルのアライメント保護を破壊し、有害な要求に従わせることを実証した。異なるモデルファミリーを用いた広範な実験を通じて、ランダムな方向へのステアリングでさえ、有害な要求への従順性の確率を0%から2-27%に増加させることを示した。さらに懸念すべきことに、解釈可能な方向の一般的な源であるスパースオートエンコーダー（SAE）から良性の特徴をステアリングすると、これらの確率がさらに2-4%増加する。最後に、単一のプロンプトをジャイルブレイクする20個のランダムにサンプリングされたベクトルを組み合わせることで、未見の要求に対する有害な従順性を大幅に増加させるユニバーサル攻撃が作成できることを示した。これらの結果は、解釈可能性を通じた安全性というパラダイムに疑問を投げかけ、モデルの内部に対する精密な制御が必ずしもモデルの挙動に対する精密な制御を保証しないことを示している。

English

Activation steering is a promising technique for controlling LLM behavior by adding semantically meaningful vectors directly into a model's hidden states during inference. It is often framed as a precise, interpretable, and potentially safer alternative to fine-tuning. We demonstrate the opposite: steering systematically breaks model alignment safeguards, making it comply with harmful requests. Through extensive experiments on different model families, we show that even steering in a random direction can increase the probability of harmful compliance from 0% to 2-27%. Alarmingly, steering benign features from a sparse autoencoder (SAE), a common source of interpretable directions, increases these rates by a further 2-4%. Finally, we show that combining 20 randomly sampled vectors that jailbreak a single prompt creates a universal attack, significantly increasing harmful compliance on unseen requests. These results challenge the paradigm of safety through interpretability, showing that precise control over model internals does not guarantee precise control over model behavior.

ロウグ・スカルペル：アクティベーション・ステアリングによるLLM安全性の侵害

The Rogue Scalpel: Activation Steering Compromises LLM Safety

要旨

Support