대규모 추론 모델은 결함 있는 사고로부터 더 나은 정렬을 학습한다

초록

대규모 추론 모델(LRMs)은 최종 답변을 생성하기 전에 구조화된 사고의 연쇄(CoT)를 생성함으로써 "생각"하지만, 여전히 안전 정렬에 대해 비판적으로 추론할 능력이 부족하며, 결함이 있는 전제가 사고 과정에 주입되면 쉽게 편향될 수 있습니다. 우리는 RECAP(Robust Safety Alignment via Counter-Aligned Prefilling)를 제안합니다. 이는 사후 훈련을 위한 원칙 기반 강화 학습(RL) 방법으로, 모델이 결함이 있는 추론 궤적을 재정의하고 안전하고 유용한 응답으로 재라우팅하도록 명시적으로 가르칩니다. RECAP은 합성적으로 생성된 반대 정렬 CoT 프리필과 표준 프롬프트의 혼합물로 훈련되며, 인간 피드백을 통한 일반 강화 학습(RLHF) 외에 추가적인 훈련 비용이나 수정이 필요하지 않습니다. 또한 안전성과 탈옥 방지 강건성을 크게 향상시키고, 과도한 거부를 줄이며, 핵심 추론 능력을 유지합니다. 이 모든 것이 추론 토큰 예산을 유지하면서 이루어집니다. 광범위한 분석 결과, RECAP으로 훈련된 모델은 자체 반성을 더 자주 수행하며 적응형 공격 하에서도 강건성을 유지하여, 반복적인 추론 재정의 시도 후에도 안전성을 보존합니다.

English

Large reasoning models (LRMs) "think" by generating structured chain-of-thought (CoT) before producing a final answer, yet they still lack the ability to reason critically about safety alignment and are easily biased when a flawed premise is injected into their thought process. We propose RECAP (Robust Safety Alignment via Counter-Aligned Prefilling), a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories and reroute to safe and helpful responses. RECAP trains on a mixture of synthetically generated counter-aligned CoT prefills and standard prompts, requires no additional training cost or modifications beyond vanilla reinforcement learning from human feedback (RLHF), and substantially improves safety and jailbreak robustness, reduces overrefusal, and preserves core reasoning capability -- all while maintaining inference token budget. Extensive analysis shows that RECAP-trained models engage in self-reflection more frequently and remain robust under adaptive attacks, preserving safety even after repeated attempts to override their reasoning.

대규모 추론 모델은 결함 있는 사고로부터 더 나은 정렬을 학습한다

Large Reasoning Models Learn Better Alignment from Flawed Thinking

초록

Support