THINKSAFE:推理模型的自生成安全对齐机制
THINKSAFE: Self-Generated Safety Alignment for Reasoning Models
January 30, 2026
作者: Seanie Lee, Sangwoo Park, Yumin Choi, Gyeongman Kim, Minki Kang, Jihun Yun, Dongmin Park, Jongho Park, Sung Ju Hwang
cs.AI
摘要
大型推理模型(LRMs)通过强化学习在推理任务上生成长链思维推理,取得了显著性能。然而这种过度优化往往优先满足指令遵循性,使模型易受恶意提示的影响。为缓解这种安全性退化,现有方法依赖外部教师蒸馏,但这会引入分布差异从而损害原始推理能力。我们提出ThinkSafe框架,通过自生成对齐机制在无需外部教师的情况下恢复安全对齐。核心洞见在于:尽管指令遵循会抑制安全机制,模型通常仍保留识别危害的潜在知识。ThinkSafe通过轻量级拒绝引导解锁这种知识,指导模型生成分布内的安全推理轨迹。基于这些自生成响应的微调能有效重校准模型,同时最小化分布偏移。在DeepSeek-R1-Distill和Qwen3上的实验表明,ThinkSafe在保持推理能力的同时显著提升安全性。值得注意的是,该方法以显著降低的计算成本实现了优于GRPO的安全性和相当的推理性能。代码、模型及数据集详见https://github.com/seanie12/ThinkSafe.git。
English
Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We propose ThinkSafe, a self-generated alignment framework that restores safety alignment without external teachers. Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, guiding the model to generate in-distribution safety reasoning traces. Fine-tuning on these self-generated responses effectively realigns the model while minimizing distribution shift. Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency. Notably, it achieves superior safety and comparable reasoning to GRPO, with significantly reduced computational cost. Code, models, and datasets are available at https://github.com/seanie12/ThinkSafe.git.