ChatPaper.aiChatPaper

THINKSAFE:推理模型的自生成安全校準機制

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

January 30, 2026
作者: Seanie Lee, Sangwoo Park, Yumin Choi, Gyeongman Kim, Minki Kang, Jihun Yun, Dongmin Park, Jongho Park, Sung Ju Hwang
cs.AI

摘要

大型推理模型(LRMs)通過在推理任務上運用強化學習來生成長鏈式思維推理,從而實現卓越性能。然而,這種過度優化往往優先考慮指令遵循,使模型容易受到有害提示的影響。為緩解此安全退化問題,現有方法依賴外部教師蒸餾,但這會引入分佈差異從而損害原始推理能力。我們提出ThinkSafe——一種無需外部教師的自生成對齊框架,可恢復安全對齊。關鍵洞見在於:雖然指令遵循會抑制安全機制,但模型通常保留識別危害的潛在知識。ThinkSafe通過輕量級拒答引導解鎖此知識,指導模型生成分佈內的安全推理軌跡。基於這些自生成回應的微調能有效重新對齊模型,同時最小化分佈偏移。在DeepSeek-R1-Distill和Qwen3上的實驗表明,ThinkSafe在保持推理能力的同時顯著提升安全性。尤其值得注意的是,其安全性能優於GRPO且推理能力相當,但計算成本顯著降低。程式碼、模型與資料集已開源於:https://github.com/seanie12/ThinkSafe.git。
English
Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We propose ThinkSafe, a self-generated alignment framework that restores safety alignment without external teachers. Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, guiding the model to generate in-distribution safety reasoning traces. Fine-tuning on these self-generated responses effectively realigns the model while minimizing distribution shift. Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency. Notably, it achieves superior safety and comparable reasoning to GRPO, with significantly reduced computational cost. Code, models, and datasets are available at https://github.com/seanie12/ThinkSafe.git.
PDF373February 3, 2026