AdvChain：對抗性思維鏈微調實現大型推理模型的穩健安全對齊

摘要

大型推理模型（LRMs）通過思維鏈（CoT）推理在複雜問題解決中展現了卓越的能力。然而，CoT的多步驟特性引入了超越傳統語言模型對齊的新安全挑戰。我們發現當前安全CoT調優方法中存在一種失效模式：雪球效應，即微小的推理偏差在整個思維過程中逐漸放大，導致有害的順從或過度的拒絕。這種效應源於模型被訓練模仿完美的推理腳本，而未能學會自我糾正。為解決這一限制，我們提出了AdvChain，這是一種通過對抗性CoT調優來教導模型動態自我糾正的對齊範式。我們的方法包括構建包含誘惑-糾正和猶豫-糾正樣本的數據集，使模型學會從有害的推理偏差和不必要的謹慎中恢復。大量實驗表明，AdvChain顯著增強了對越獄攻擊和CoT劫持的魯棒性，同時大幅減少了對良性提示的過度拒絕，在不損害推理能力的情況下實現了更優的安全-效用平衡。我們的工作為構建更強大、更可靠的推理模型開闢了新的方向。

English

Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in complex problem-solving through Chain-of-Thought (CoT) reasoning. However, the multi-step nature of CoT introduces new safety challenges that extend beyond conventional language model alignment. We identify a failure mode in current safety CoT tuning methods: the snowball effect, where minor reasoning deviations progressively amplify throughout the thought process, leading to either harmful compliance or excessive refusal. This effect stems from models being trained to imitate perfect reasoning scripts without learning to self-correct. To address this limitation, we propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning. Our method involves constructing a dataset containing Temptation-Correction and Hesitation-Correction samples, where models learn to recover from harmful reasoning drifts and unnecessary cautions. Extensive experiments show that AdvChain significantly enhances robustness against jailbreak attacks and CoT hijacking while substantially reducing over-refusal on benign prompts, achieving a superior safety-utility balance without compromising reasoning capabilities. Our work establishes a new direction for building more robust and reliable reasoning models.

AdvChain：對抗性思維鏈微調實現大型推理模型的穩健安全對齊

AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models

摘要

Support