思考トークンは安全性に役立つか？

要旨

近年の推論モデルは、思考トークン（thinking tokens）を利用することで、指示調整モデル（instruction-tuned counterparts）よりもベンチマークで優れた性能を達成している。また一般に、このより「熟慮的」なモードは、モデルが要求に対する回答が自身の安全原則に違反していないかを検討するための安全な空間を提供することで、アライメントと安全性を向上させると考えられている。本稿では、この直感が必ずしも正しくないことを示す証拠を提示する。GPT-OSS、Qwen、Olmo、Phiファミリーにわたるフロンティアのオープンウェイト推論モデルにおいて、可視の思考が始まる前の最初のトークンの隠れ表現に学習済みヘッドを適用することで、最終的な拒否/遵守の結果がすでに強く予測可能である（拒否/遵守の予測においてAUROC 0.84-0.95、バランス精度約88%）ことを発見した。思考プロセスは、熟慮的な修正というよりも、むしろ接頭辞完成（prefix completion）に近く、テキストレベルでは熟議のように見えるにもかかわらず（テキストレベルの熟議の約74%は、応答分布がすでに拒否/遵守の片側に固定された後に発生する）、思考の最初の約20%以降では最終結果が変化することは稀である。また、既存の推論時および訓練ベースの安全介入は、熟議を誘発するという目的に動機づけられているにもかかわらず、ほとんどがモデルの行動を過剰拒否へとシフトさせ、すでに乏しい熟議シグナルを抑制していることも明らかにした。これらの結果は、現在の推論モデルにおける安全行動が一般的に想定されるよりもはるかに熟議的ではないことを示唆しており、真の安全熟議を誘発する手法の必要性を強調するものである。

English

Today's reasoning models use thinking tokens to attain stronger performance on benchmarks than their instruction-tuned counterparts. It is also generally believed that this more "deliberative" mode should improve alignment and safety, by providing the model a safe space to consider whether its planned answer to a request violates its safety principles. We present evidence that this intuition is not always correct. Across frontier open-weight reasoning models spanning GPT-OSS, Qwen, Olmo, and Phi families, we find that the eventual refusal/compliance outcome is already strongly predictable via a trained head on the first token's hidden representation (0.84-0.95 AUROC and sim88% balanced accuracy for predicting refusal/compliance) before any visible thinking. The thinking process turns out to be more akin to prefix completion than to deliberative revision, with the final outcome rarely changing after the first sim20% of thinking, despite giving the appearance of deliberation at the text level (sim74% of text-level deliberations occur when the response distribution is already locked to one refusal/compliance side). We also find that existing inference-time and training-based safety interventions, despite being motivated by the goal of inducing deliberation, largely shift model behavior toward over-refusal while suppressing already-scarce deliberation signals. Our results suggest that safety behavior in current reasoning models is much less deliberative than commonly assumed, and highlight the need for methods that induce real safety deliberation.