SafeKey: 安全性推論のための「アハ体験」的洞察の増幅

要旨

大規模推論モデル（LRMs）は、回答前に明示的に推論を行う新たな生成パラダイムを導入し、複雑なタスクにおいて顕著な改善をもたらします。しかし、有害なクエリや敵対的攻撃に対して大きな安全リスクを抱えています。最近のLRMsに対する主流の安全対策である教師ありファインチューニング（SFT）は安全性を向上させますが、SFTで調整されたモデルは未見のジェイルブレイクプロンプトに一般化するのが難しいことがわかりました。LRMsの生成を徹底的に調査した結果、安全性推論を活性化し安全な応答につながる「安全なるほど瞬間」を特定しました。このなるほど瞬間は通常、モデルのクエリ理解プロセスに続き、モデルが安全に進むかどうかを示す「キーセンテンス」に現れます。これらの知見に基づき、キーセンテンスにおける安全なるほど瞬間をより良く活性化するための2つの補完的な目的を含むSafeKeyを提案します：（1）キーセンテンス前にモデルの内部表現における安全信号を強化するデュアルパス安全ヘッド、（2）重要な安全ヒントを含むクエリ理解にモデルの注意を向けさせるクエリマスクモデリング目的。複数の安全ベンチマークでの実験により、私たちの手法が幅広いジェイルブレイク攻撃や分布外の有害プロンプトに対する安全性の一般化を大幅に改善し、平均有害率を9.6％低下させながら一般的な能力を維持することが実証されました。分析により、SafeKeyが内部の注意を再形成し隠れ表現の品質を向上させることで安全性を高める仕組みが明らかになりました。

English

Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering, leading to remarkable improvements in complex tasks. However, they pose great safety risks against harmful queries and adversarial attacks. While recent mainstream safety efforts on LRMs, supervised fine-tuning (SFT), improve safety performance, we find that SFT-aligned models struggle to generalize to unseen jailbreak prompts. After thorough investigation of LRMs' generation, we identify a safety aha moment that can activate safety reasoning and lead to a safe response. This aha moment typically appears in the `key sentence', which follows models' query understanding process and can indicate whether the model will proceed safely. Based on these insights, we propose SafeKey, including two complementary objectives to better activate the safety aha moment in the key sentence: (1) a Dual-Path Safety Head to enhance the safety signal in the model's internal representations before the key sentence, and (2) a Query-Mask Modeling objective to improve the models' attention on its query understanding, which has important safety hints. Experiments across multiple safety benchmarks demonstrate that our methods significantly improve safety generalization to a wide range of jailbreak attacks and out-of-distribution harmful prompts, lowering the average harmfulness rate by 9.6\%, while maintaining general abilities. Our analysis reveals how SafeKey enhances safety by reshaping internal attention and improving the quality of hidden representations.

SafeKey: 安全性推論のための「アハ体験」的洞察の増幅

SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning

要旨

Support