拒绝率断崖式下降：安全对齐在推理中为何失效？

摘要

具备多步推理能力的大型推理模型（LRMs）展现了卓越的问题解决能力，然而它们也表现出令人担忧的安全漏洞，这些漏洞至今仍未被充分理解。在本研究中，我们通过机制可解释性的视角，探讨了为何推理模型的安全对齐会失效。采用线性探测方法追踪拒绝意图在令牌位置上的变化，我们发现了一个显著现象，称为“拒绝悬崖”：许多对齐不佳的推理模型能够正确识别有害提示，并在其思考过程中保持强烈的拒绝意图，但在输出生成前的最后几个令牌处，拒绝评分急剧下降。这表明这些模型并非本质不安全，而是其拒绝意图被系统性地抑制了。通过因果干预分析，我们识别出一组稀疏的注意力头，它们对拒绝行为产生了负面影响。仅消融这些头部中的3%，即可将攻击成功率降至10%以下。基于这些机制性洞察，我们提出了“悬崖即法官”（Cliff-as-a-Judge），一种新颖的数据选择方法，该方法识别出展现出最大拒绝悬崖的训练样本，以高效修复推理模型的安全对齐。此方法仅使用1.7%的常规安全训练数据，便实现了可媲美的安全提升，展示了安全对齐中“少即是多”的效应。

English

Large reasoning models (LRMs) with multi-step reasoning capabilities have shown remarkable problem-solving abilities, yet they exhibit concerning safety vulnerabilities that remain poorly understood. In this work, we investigate why safety alignment fails in reasoning models through a mechanistic interpretability lens. Using a linear probing approach to trace refusal intentions across token positions, we discover a striking phenomenon termed as refusal cliff: many poorly-aligned reasoning models correctly identify harmful prompts and maintain strong refusal intentions during their thinking process, but experience a sharp drop in refusal scores at the final tokens before output generation. This suggests that these models are not inherently unsafe; rather, their refusal intentions are systematically suppressed. Through causal intervention analysis, we identify a sparse set of attention heads that negatively contribute to refusal behavior. Ablating just 3\% of these heads can reduce attack success rates below 10\%. Building on these mechanistic insights, we propose Cliff-as-a-Judge, a novel data selection method that identifies training examples exhibiting the largest refusal cliff to efficiently repair reasoning models' safety alignment. This approach achieves comparable safety improvements using only 1.7\% of the vanilla safety training data, demonstrating a less-is-more effect in safety alignment.

拒绝率断崖式下降：安全对齐在推理中为何失效？

Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning?

摘要

Support