拒絕率斷崖式下跌:安全對齊在推理中的失效機制
Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?
October 7, 2025
作者: Qingyu Yin, Chak Tou Leong, Linyi Yang, Wenxuan Huang, Wenjie Li, Xiting Wang, Jaehong Yoon, YunXing, XingYu, Jinjin Gu
cs.AI
摘要
具備多步推理能力的大型推理模型(LRMs)展現了卓越的問題解決能力,然而它們也顯露出令人擔憂的安全漏洞,這些漏洞至今仍未被充分理解。在本研究中,我們透過機制解釋性的視角,探討了為何安全對齊在推理模型中失效。採用線性探測方法追蹤拒絕意圖在詞元位置上的變化,我們發現了一個顯著的現象,稱之為“拒絕懸崖”:許多對齊不佳的推理模型能夠正確識別有害提示,並在思考過程中保持強烈的拒絕意圖,但在輸出生成前的最後幾個詞元處,拒絕分數急劇下降。這表明這些模型並非本質上不安全,而是其拒絕意圖被系統性地抑制了。通過因果干預分析,我們識別出一組稀疏的注意力頭,它們對拒絕行為產生了負面影響。僅消融這些頭中的3%,即可將攻擊成功率降低至10%以下。基於這些機制性洞察,我們提出了“懸崖即裁判”(Cliff-as-a-Judge),這是一種新穎的數據選擇方法,它識別出展現最大拒絕懸崖的訓練樣例,以高效修復推理模型的安全對齊。該方法僅使用1.7%的常規安全訓練數據,便達到了相當的安全改進效果,展示了安全對齊中的“少即是多”效應。
English
Large reasoning models (LRMs) with multi-step reasoning capabilities have
shown remarkable problem-solving abilities, yet they exhibit concerning safety
vulnerabilities that remain poorly understood. In this work, we investigate why
safety alignment fails in reasoning models through a mechanistic
interpretability lens. Using a linear probing approach to trace refusal
intentions across token positions, we discover a striking phenomenon termed as
refusal cliff: many poorly-aligned reasoning models correctly identify
harmful prompts and maintain strong refusal intentions during their thinking
process, but experience a sharp drop in refusal scores at the final tokens
before output generation. This suggests that these models are not inherently
unsafe; rather, their refusal intentions are systematically suppressed. Through
causal intervention analysis, we identify a sparse set of attention heads that
negatively contribute to refusal behavior. Ablating just 3\% of these heads can
reduce attack success rates below 10\%. Building on these mechanistic insights,
we propose Cliff-as-a-Judge, a novel data selection method that
identifies training examples exhibiting the largest refusal cliff to
efficiently repair reasoning models' safety alignment. This approach achieves
comparable safety improvements using only 1.7\% of the vanilla safety training
data, demonstrating a less-is-more effect in safety alignment.