ChatPaper.aiChatPaper

拒绝率断崖式下降:安全对齐在推理中为何失效?

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

October 7, 2025
作者: Qingyu Yin, Chak Tou Leong, Linyi Yang, Wenxuan Huang, Wenjie Li, Xiting Wang, Jaehong Yoon, YunXing, XingYu, Jinjin Gu
cs.AI

摘要

具备多步推理能力的大型推理模型(LRMs)展现了卓越的问题解决能力,然而它们也表现出令人担忧的安全漏洞,这些漏洞至今仍未被充分理解。在本研究中,我们通过机制可解释性的视角,探讨了为何推理模型的安全对齐会失效。采用线性探测方法追踪拒绝意图在令牌位置上的变化,我们发现了一个显著现象,称为“拒绝悬崖”:许多对齐不佳的推理模型能够正确识别有害提示,并在其思考过程中保持强烈的拒绝意图,但在输出生成前的最后几个令牌处,拒绝评分急剧下降。这表明这些模型并非本质不安全,而是其拒绝意图被系统性地抑制了。通过因果干预分析,我们识别出一组稀疏的注意力头,它们对拒绝行为产生了负面影响。仅消融这些头部中的3%,即可将攻击成功率降至10%以下。基于这些机制性洞察,我们提出了“悬崖即法官”(Cliff-as-a-Judge),一种新颖的数据选择方法,该方法识别出展现出最大拒绝悬崖的训练样本,以高效修复推理模型的安全对齐。此方法仅使用1.7%的常规安全训练数据,便实现了可媲美的安全提升,展示了安全对齐中“少即是多”的效应。
English
Large reasoning models (LRMs) with multi-step reasoning capabilities have shown remarkable problem-solving abilities, yet they exhibit concerning safety vulnerabilities that remain poorly understood. In this work, we investigate why safety alignment fails in reasoning models through a mechanistic interpretability lens. Using a linear probing approach to trace refusal intentions across token positions, we discover a striking phenomenon termed as refusal cliff: many poorly-aligned reasoning models correctly identify harmful prompts and maintain strong refusal intentions during their thinking process, but experience a sharp drop in refusal scores at the final tokens before output generation. This suggests that these models are not inherently unsafe; rather, their refusal intentions are systematically suppressed. Through causal intervention analysis, we identify a sparse set of attention heads that negatively contribute to refusal behavior. Ablating just 3\% of these heads can reduce attack success rates below 10\%. Building on these mechanistic insights, we propose Cliff-as-a-Judge, a novel data selection method that identifies training examples exhibiting the largest refusal cliff to efficiently repair reasoning models' safety alignment. This approach achieves comparable safety improvements using only 1.7\% of the vanilla safety training data, demonstrating a less-is-more effect in safety alignment.
PDF62October 8, 2025