AdvChain：通过对抗性思维链调优实现大型推理模型的稳健安全对齐

摘要

大型推理模型（LRMs）通过思维链（CoT）推理在复杂问题解决中展现了卓越能力。然而，CoT的多步骤特性引入了超越传统语言模型对齐的新安全挑战。我们识别出当前安全CoT调优方法中的一个失效模式：雪球效应，即微小的推理偏差在思维过程中逐步放大，导致有害的顺从或过度拒绝。这一效应源于模型被训练模仿完美推理脚本，而未学会自我纠正。为应对这一局限，我们提出AdvChain，一种通过对抗性CoT调优教导模型动态自我纠正的对齐范式。我们的方法包括构建包含诱惑-纠正和犹豫-纠正样本的数据集，使模型学会从有害推理偏差和不必要的谨慎中恢复。大量实验表明，AdvChain显著增强了对越狱攻击和CoT劫持的鲁棒性，同时大幅减少了对良性提示的过度拒绝，在不损害推理能力的情况下实现了更优的安全-效用平衡。我们的工作为构建更稳健可靠的推理模型开辟了新方向。

English

Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in complex problem-solving through Chain-of-Thought (CoT) reasoning. However, the multi-step nature of CoT introduces new safety challenges that extend beyond conventional language model alignment. We identify a failure mode in current safety CoT tuning methods: the snowball effect, where minor reasoning deviations progressively amplify throughout the thought process, leading to either harmful compliance or excessive refusal. This effect stems from models being trained to imitate perfect reasoning scripts without learning to self-correct. To address this limitation, we propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning. Our method involves constructing a dataset containing Temptation-Correction and Hesitation-Correction samples, where models learn to recover from harmful reasoning drifts and unnecessary cautions. Extensive experiments show that AdvChain significantly enhances robustness against jailbreak attacks and CoT hijacking while substantially reducing over-refusal on benign prompts, achieving a superior safety-utility balance without compromising reasoning capabilities. Our work establishes a new direction for building more robust and reliable reasoning models.

AdvChain：通过对抗性思维链调优实现大型推理模型的稳健安全对齐

AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models

摘要

Support