SafeKey:强化顿悟洞察以提升安全推理能力
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
May 22, 2025
作者: Kaiwen Zhou, Xuandong Zhao, Gaowen Liu, Jayanth Srinivasa, Aosong Feng, Dawn Song, Xin Eric Wang
cs.AI
摘要
大型推理模型(LRMs)引入了一种在回答前进行显式推理的新一代范式,显著提升了复杂任务的表现。然而,面对有害查询和对抗性攻击,它们也带来了巨大的安全风险。尽管近期针对LRMs的主流安全措施——监督微调(SFT)提升了安全性能,但我们发现,经过SFT对齐的模型难以泛化到未见过的越狱提示。通过深入分析LRMs的生成过程,我们识别出一个能够激活安全推理并引导安全响应的“安全顿悟时刻”。这一顿悟时刻通常出现在“关键句”中,紧随模型对查询的理解过程之后,并能预示模型是否会安全地进行后续操作。基于这些洞察,我们提出了SafeKey方法,包含两个互补目标,以更好地在关键句中激活安全顿悟时刻:(1)双路径安全头,用于增强关键句前模型内部表示中的安全信号;(2)查询掩码建模目标,旨在提升模型对其查询理解的注意力,其中蕴含重要的安全线索。跨多个安全基准的实验表明,我们的方法显著提升了对广泛越狱攻击和分布外有害提示的安全泛化能力,平均有害率降低了9.6%,同时保持了模型的通用能力。我们的分析揭示了SafeKey如何通过重塑内部注意力及提升隐藏表示质量来增强安全性。
English
Large Reasoning Models (LRMs) introduce a new generation paradigm of
explicitly reasoning before answering, leading to remarkable improvements in
complex tasks. However, they pose great safety risks against harmful queries
and adversarial attacks. While recent mainstream safety efforts on LRMs,
supervised fine-tuning (SFT), improve safety performance, we find that
SFT-aligned models struggle to generalize to unseen jailbreak prompts. After
thorough investigation of LRMs' generation, we identify a safety aha moment
that can activate safety reasoning and lead to a safe response. This aha moment
typically appears in the `key sentence', which follows models' query
understanding process and can indicate whether the model will proceed safely.
Based on these insights, we propose SafeKey, including two complementary
objectives to better activate the safety aha moment in the key sentence: (1) a
Dual-Path Safety Head to enhance the safety signal in the model's internal
representations before the key sentence, and (2) a Query-Mask Modeling
objective to improve the models' attention on its query understanding, which
has important safety hints. Experiments across multiple safety benchmarks
demonstrate that our methods significantly improve safety generalization to a
wide range of jailbreak attacks and out-of-distribution harmful prompts,
lowering the average harmfulness rate by 9.6\%, while maintaining general
abilities. Our analysis reveals how SafeKey enhances safety by reshaping
internal attention and improving the quality of hidden representations.Summary
AI-Generated Summary