思考令牌是否有助于提升安全性？

摘要

如今，推理模型通过使用思考令牌，在基准测试中取得了比指令微调版本更强的性能。人们普遍认为，这种更具"审慎性"的模式应当能够提升对齐性与安全性——通过为模型提供一个安全空间，使其能够斟酌计划中对用户请求的回答是否违反安全准则。但我们发现，这一直觉并不总是成立。在GPT-OSS、Qwen、Olmo和Phi系列等前沿开源权重推理模型中，我们发现，在可见思考过程开始之前，通过已训练完成的头部分类器对首个令牌的隐藏表示进行分析，即可高度预测模型最终的拒绝/遵从结果（AUROC值达0.84-0.95，预测拒绝/遵从的平衡准确率约为88%）。事实证明，思考过程更接近前缀补全而非审慎修订：尽管文本层面看似存在审慎思考（约74%的文本层审慎思考发生时，响应分布已锁定在拒绝或遵从的单一方向），但最终结果在思考过程的前20%阶段后极少发生改变。我们还发现，现有基于推理阶段和训练的干预措施，尽管以诱导审慎思考为目标，却主要导致模型行为转向过度拒绝，同时抑制了本已稀少的审慎信号。我们的研究结果表明，当前推理模型的安全行为远非通常假设的那般审慎，并凸显了开发真正引发审慎思考方法的必要性。

English

Today's reasoning models use thinking tokens to attain stronger performance on benchmarks than their instruction-tuned counterparts. It is also generally believed that this more "deliberative" mode should improve alignment and safety, by providing the model a safe space to consider whether its planned answer to a request violates its safety principles. We present evidence that this intuition is not always correct. Across frontier open-weight reasoning models spanning GPT-OSS, Qwen, Olmo, and Phi families, we find that the eventual refusal/compliance outcome is already strongly predictable via a trained head on the first token's hidden representation (0.84-0.95 AUROC and sim88% balanced accuracy for predicting refusal/compliance) before any visible thinking. The thinking process turns out to be more akin to prefix completion than to deliberative revision, with the final outcome rarely changing after the first sim20% of thinking, despite giving the appearance of deliberation at the text level (sim74% of text-level deliberations occur when the response distribution is already locked to one refusal/compliance side). We also find that existing inference-time and training-based safety interventions, despite being motivated by the goal of inducing deliberation, largely shift model behavior toward over-refusal while suppressing already-scarce deliberation signals. Our results suggest that safety behavior in current reasoning models is much less deliberative than commonly assumed, and highlight the need for methods that induce real safety deliberation.