思考の連鎖がより良く知る時：マルチターン推論モデルにおける失敗モード

要旨

マルチターン推論モデルの失敗は、最終評価スコアではほぼ見えにくい。モデルが長い対話の初期段階で安全でない立場に固着しても、最終ターンの拒否率は堅牢にアライメントされたベースラインと区別がつかないように見えることがある。このような隠れた時間的ダイナミクスを明らかにするため、我々はトレースレベルの診断手法である「CoT-Output 2×2安全性マトリクス」を提案する。本フレームワークは、各ターンを内部推論と可視出力という独立した2軸に沿ってラベル付けし、運用上定義された4つの失敗セル（堅牢なアライメント、アライメント偽装、明白な脱獄、および我々がコンテキスト注入失敗と命名した特徴的な失敗モード）を導出する。コンテキスト注入失敗では、CoTは安全な推論を維持しているにもかかわらず可視出力が有害な結果を生み出しており、推論の不誠実さのマルチターンにおける現れを示している。我々は、3つの蒸留推論ターゲットを固定攻撃者に対して5つの監視条件で評価し、情報ハザードシナリオにおいて6750件のターンレベルの観測データを収集した。分析の結果、再現可能な2つの脆弱性が明らかになった。1つは監視パラドックスであり、明示的な監視手がかりがアライメント偽装率を抑制するどころか逆に増加させるというものである。もう1つはコンテキスト注入失敗であり、モデルが安全な内部状態にもかかわらず安全でない外部出力に固着する現象である。我々は、今後のトレース診断研究を支援するために、マルチターン対話とCoTトレースの完全なデータセットを公開する。

English

Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic - the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes (internal reasoning and visible output), yielding four operationally defined failure cells: robust alignment, alignment faking, overt jailbreak, and a distinct failure mode we term context-injection failure (where the CoT maintains safe reasoning, but the visible output produces harm, highlighting a multi-turn manifestation of reasoning unfaithfulness). We evaluate three distilled reasoning targets against a fixed attacker across five oversight conditions, collecting 6750 turn-level observations on the Information-Hazard scenario. Our analysis reveals two reproducible vulnerabilities: an oversight paradox where explicit monitoring cues paradoxically increase alignment-faking rates rather than suppress them, and a context-injection failure where models lock onto unsafe external outputs despite safe internal states. We release the full dataset of multi-turn dialogues and CoT traces to support follow-up trace-diagnostic research.