大規模推論モデルは、欠陥のある思考からより良い整合性を学習する

要旨

大規模推論モデル（LRM）は、最終的な回答を生成する前に構造化された連鎖的思考（CoT）を生成することで「思考」を行うが、安全性の整合性について批判的に推論する能力が欠けており、誤った前提が思考プロセスに注入されると容易にバイアスがかかる。本研究では、RECAP（Robust Safety Alignment via Counter-Aligned Prefilling）を提案する。これは、モデルに誤った推論軌道を上書きし、安全で有益な応答に再ルーティングすることを明示的に教える、原則に基づいた強化学習（RL）手法である。RECAPは、合成的に生成されたカウンターアラインドCoTプリフィルと標準プロンプトの混合で学習し、人間のフィードバックからの標準的な強化学習（RLHF）を超える追加の学習コストや修正を必要とせず、安全性とジェイルブレイクに対する堅牢性を大幅に向上させ、過剰拒否を減少させ、中核的な推論能力を維持する――すべて推論トークンの予算を維持しながら行う。詳細な分析により、RECAPで学習されたモデルは自己反省をより頻繁に行い、適応的攻撃下でも堅牢性を保ち、推論を上書きしようとする繰り返しの試みの後も安全性を維持することが示された。

English

Large reasoning models (LRMs) "think" by generating structured chain-of-thought (CoT) before producing a final answer, yet they still lack the ability to reason critically about safety alignment and are easily biased when a flawed premise is injected into their thought process. We propose RECAP (Robust Safety Alignment via Counter-Aligned Prefilling), a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories and reroute to safe and helpful responses. RECAP trains on a mixture of synthetically generated counter-aligned CoT prefills and standard prompts, requires no additional training cost or modifications beyond vanilla reinforcement learning from human feedback (RLHF), and substantially improves safety and jailbreak robustness, reduces overrefusal, and preserves core reasoning capability -- all while maintaining inference token budget. Extensive analysis shows that RECAP-trained models engage in self-reflection more frequently and remain robust under adaptive attacks, preserving safety even after repeated attempts to override their reasoning.

大規模推論モデルは、欠陥のある思考からより良い整合性を学習する

Large Reasoning Models Learn Better Alignment from Flawed Thinking

要旨

Support