AdvChain:通过对抗性思维链调优实现大型推理模型的稳健安全对齐
AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models
September 29, 2025
作者: Zihao Zhu, Xinyu Wu, Gehan Hu, Siwei Lyu, Ke Xu, Baoyuan Wu
cs.AI
摘要
大型推理模型(LRMs)通过思维链(CoT)推理在复杂问题解决中展现了卓越能力。然而,CoT的多步骤特性引入了超越传统语言模型对齐的新安全挑战。我们识别出当前安全CoT调优方法中的一个失效模式:雪球效应,即微小的推理偏差在思维过程中逐步放大,导致有害的顺从或过度拒绝。这一效应源于模型被训练模仿完美推理脚本,而未学会自我纠正。为应对这一局限,我们提出AdvChain,一种通过对抗性CoT调优教导模型动态自我纠正的对齐范式。我们的方法包括构建包含诱惑-纠正和犹豫-纠正样本的数据集,使模型学会从有害推理偏差和不必要的谨慎中恢复。大量实验表明,AdvChain显著增强了对越狱攻击和CoT劫持的鲁棒性,同时大幅减少了对良性提示的过度拒绝,在不损害推理能力的情况下实现了更优的安全-效用平衡。我们的工作为构建更稳健可靠的推理模型开辟了新方向。
English
Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in
complex problem-solving through Chain-of-Thought (CoT) reasoning. However, the
multi-step nature of CoT introduces new safety challenges that extend beyond
conventional language model alignment. We identify a failure mode in current
safety CoT tuning methods: the snowball effect, where minor reasoning
deviations progressively amplify throughout the thought process, leading to
either harmful compliance or excessive refusal. This effect stems from models
being trained to imitate perfect reasoning scripts without learning to
self-correct. To address this limitation, we propose AdvChain, an alignment
paradigm that teaches models dynamic self-correction through adversarial CoT
tuning. Our method involves constructing a dataset containing
Temptation-Correction and Hesitation-Correction samples, where models learn to
recover from harmful reasoning drifts and unnecessary cautions. Extensive
experiments show that AdvChain significantly enhances robustness against
jailbreak attacks and CoT hijacking while substantially reducing over-refusal
on benign prompts, achieving a superior safety-utility balance without
compromising reasoning capabilities. Our work establishes a new direction for
building more robust and reliable reasoning models.