大型推理模型從有缺陷的思考中學習更好的對齊

摘要

大型推理模型（LRMs）通過生成結構化的思維鏈（CoT）來「思考」，但在產生最終答案之前，它們仍然缺乏對安全對齊進行批判性推理的能力，並且在思維過程中注入有缺陷的前提時容易產生偏見。我們提出了RECAP（通過對齊預填充實現的魯棒安全對齊），這是一種原則性的強化學習（RL）方法，用於後期訓練，明確教導模型覆蓋有缺陷的推理軌跡並重新路由到安全和有用的回應。RECAP在合成生成的對齊預填充CoT和標準提示的混合數據上進行訓練，無需額外的訓練成本或修改，僅需基於人類反饋的強化學習（RLHF），並顯著提高了安全性和越獄魯棒性，減少了過度拒絕，並保留了核心推理能力——同時保持推理令牌預算。廣泛的分析表明，經過RECAP訓練的模型更頻繁地進行自我反思，並在自適應攻擊下保持魯棒性，即使在多次嘗試覆蓋其推理後仍能保持安全性。

English

Large reasoning models (LRMs) "think" by generating structured chain-of-thought (CoT) before producing a final answer, yet they still lack the ability to reason critically about safety alignment and are easily biased when a flawed premise is injected into their thought process. We propose RECAP (Robust Safety Alignment via Counter-Aligned Prefilling), a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories and reroute to safe and helpful responses. RECAP trains on a mixture of synthetically generated counter-aligned CoT prefills and standard prompts, requires no additional training cost or modifications beyond vanilla reinforcement learning from human feedback (RLHF), and substantially improves safety and jailbreak robustness, reduces overrefusal, and preserves core reasoning capability -- all while maintaining inference token budget. Extensive analysis shows that RECAP-trained models engage in self-reflection more frequently and remain robust under adaptive attacks, preserving safety even after repeated attempts to override their reasoning.

大型推理模型從有缺陷的思考中學習更好的對齊

Large Reasoning Models Learn Better Alignment from Flawed Thinking

摘要

Support