SafeKey:強化安全推理中的頓悟洞察
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
May 22, 2025
作者: Kaiwen Zhou, Xuandong Zhao, Gaowen Liu, Jayanth Srinivasa, Aosong Feng, Dawn Song, Xin Eric Wang
cs.AI
摘要
大型推理模型(LRMs)引入了一種新的生成範式,即在回答前進行顯式推理,從而顯著提升了在複雜任務中的表現。然而,它們在面對有害查詢和對抗攻擊時存在重大安全風險。儘管近期針對LRMs的主流安全措施——監督微調(SFT)——提升了安全性能,我們發現經過SFT對齊的模型在應對未見過的越獄提示時泛化能力不足。通過對LRMs生成過程的深入調查,我們識別出了一個能夠激活安全推理並引導至安全回應的“安全頓悟時刻”。這一頓悟時刻通常出現在“關鍵句子”中,該句子緊隨模型的查詢理解過程,並能指示模型是否會安全地繼續執行。基於這些洞察,我們提出了SafeKey,包含兩個互補目標以更好地在關鍵句子中激活安全頓悟時刻:(1)雙路徑安全頭,用於在關鍵句子之前增強模型內部表示中的安全信號;(2)查詢掩碼建模目標,旨在提升模型對其查詢理解的注意力,這其中蘊含著重要的安全提示。在多個安全基準上的實驗表明,我們的方法顯著提升了對廣泛越獄攻擊和分佈外有害提示的安全泛化能力,將平均有害率降低了9.6%,同時保持了模型的通用能力。我們的分析揭示了SafeKey如何通過重塑內部注意力和提升隱藏表示的質量來增強安全性。
English
Large Reasoning Models (LRMs) introduce a new generation paradigm of
explicitly reasoning before answering, leading to remarkable improvements in
complex tasks. However, they pose great safety risks against harmful queries
and adversarial attacks. While recent mainstream safety efforts on LRMs,
supervised fine-tuning (SFT), improve safety performance, we find that
SFT-aligned models struggle to generalize to unseen jailbreak prompts. After
thorough investigation of LRMs' generation, we identify a safety aha moment
that can activate safety reasoning and lead to a safe response. This aha moment
typically appears in the `key sentence', which follows models' query
understanding process and can indicate whether the model will proceed safely.
Based on these insights, we propose SafeKey, including two complementary
objectives to better activate the safety aha moment in the key sentence: (1) a
Dual-Path Safety Head to enhance the safety signal in the model's internal
representations before the key sentence, and (2) a Query-Mask Modeling
objective to improve the models' attention on its query understanding, which
has important safety hints. Experiments across multiple safety benchmarks
demonstrate that our methods significantly improve safety generalization to a
wide range of jailbreak attacks and out-of-distribution harmful prompts,
lowering the average harmfulness rate by 9.6\%, while maintaining general
abilities. Our analysis reveals how SafeKey enhances safety by reshaping
internal attention and improving the quality of hidden representations.Summary
AI-Generated Summary