AdvChain: 大規模推論モデルの堅牢な安全性アライメントのための敵対的連鎖思考チューニング

要旨

大規模推論モデル（LRMs）は、Chain-of-Thought（CoT）推論を通じて複雑な問題解決において顕著な能力を発揮することが示されている。しかし、CoTの多段階的な性質は、従来の言語モデルのアラインメントを超えた新たな安全性の課題を引き起こす。我々は、現在の安全性CoTチューニング手法における失敗モード、すなわち「雪だるま効果」を特定した。これは、わずかな推論の逸脱が思考プロセス全体で徐々に増幅され、有害な従順や過剰な拒否を引き起こす現象である。この効果は、モデルが完璧な推論スクリプトを模倣するように訓練されながら、自己修正を学ばないことに起因する。この制限を克服するため、我々はAdvChainを提案する。これは、敵対的CoTチューニングを通じてモデルに動的な自己修正を教えるアラインメントパラダイムである。我々の手法は、誘惑-修正および躊躇-修正のサンプルを含むデータセットを構築し、モデルが有害な推論の逸脱や不必要な警戒から回復することを学ぶことを含む。大規模な実験により、AdvChainがジェイルブレイク攻撃やCoTハイジャックに対する堅牢性を大幅に向上させ、良性のプロンプトに対する過剰な拒否を大幅に減少させ、推論能力を損なうことなく優れた安全性と有用性のバランスを達成することが示された。本研究は、より堅牢で信頼性の高い推論モデルを構築するための新たな方向性を確立するものである。

English

Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in complex problem-solving through Chain-of-Thought (CoT) reasoning. However, the multi-step nature of CoT introduces new safety challenges that extend beyond conventional language model alignment. We identify a failure mode in current safety CoT tuning methods: the snowball effect, where minor reasoning deviations progressively amplify throughout the thought process, leading to either harmful compliance or excessive refusal. This effect stems from models being trained to imitate perfect reasoning scripts without learning to self-correct. To address this limitation, we propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning. Our method involves constructing a dataset containing Temptation-Correction and Hesitation-Correction samples, where models learn to recover from harmful reasoning drifts and unnecessary cautions. Extensive experiments show that AdvChain significantly enhances robustness against jailbreak attacks and CoT hijacking while substantially reducing over-refusal on benign prompts, achieving a superior safety-utility balance without compromising reasoning capabilities. Our work establishes a new direction for building more robust and reliable reasoning models.

AdvChain: 大規模推論モデルの堅牢な安全性アライメントのための敵対的連鎖思考チューニング

AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models

要旨

Support