大型推理模型从有缺陷的思维中学习更优的对齐策略

摘要

大型推理模型（LRMs）通过生成结构化的思维链（CoT）来“思考”，然后得出最终答案，但它们仍缺乏对安全对齐进行批判性推理的能力，并且在思维过程中注入错误前提时容易产生偏见。我们提出了RECAP（通过反对齐预填充实现鲁棒安全对齐），这是一种原则性的强化学习（RL）方法，用于后训练阶段，明确教导模型覆盖错误的推理轨迹，并重新导向安全且有益的响应。RECAP在合成生成的反对齐CoT预填充和标准提示的混合数据上进行训练，无需额外训练成本或超越基于人类反馈的强化学习（RLHF）的修改，显著提升了安全性和抗越狱鲁棒性，减少了过度拒绝，并保留了核心推理能力——所有这些都在保持推理令牌预算的前提下实现。深入分析表明，经过RECAP训练的模型更频繁地进行自我反思，并在自适应攻击下保持鲁棒性，即使在多次尝试覆盖其推理后仍能保持安全性。

English

Large reasoning models (LRMs) "think" by generating structured chain-of-thought (CoT) before producing a final answer, yet they still lack the ability to reason critically about safety alignment and are easily biased when a flawed premise is injected into their thought process. We propose RECAP (Robust Safety Alignment via Counter-Aligned Prefilling), a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories and reroute to safe and helpful responses. RECAP trains on a mixture of synthetically generated counter-aligned CoT prefills and standard prompts, requires no additional training cost or modifications beyond vanilla reinforcement learning from human feedback (RLHF), and substantially improves safety and jailbreak robustness, reduces overrefusal, and preserves core reasoning capability -- all while maintaining inference token budget. Extensive analysis shows that RECAP-trained models engage in self-reflection more frequently and remain robust under adaptive attacks, preserving safety even after repeated attempts to override their reasoning.

大型推理模型从有缺陷的思维中学习更优的对齐策略

Large Reasoning Models Learn Better Alignment from Flawed Thinking

摘要

Support