内部対話の監視：プローブ軌跡が明らかにする推論動態

要旨

大規模推論モデル（LRM）は、思考連鎖（CoT）推論を通じて安全性監視に新たな機会をもたらす。しかし、CoTが常にモデルの最終出力に忠実であるとは限らず、監視ツールとしての信頼性を損なう。この問題に対処するため、我々はLRMの隠れ表現を調査し、プロンプトとCoT表現から将来の振る舞いを予測できるかを明らかにする。各生成トークンにおいてプローブを評価することで、推論過程全体にわたる概念確率の連続的な変化であるプローブ軌跡を構築する。将来のモデル振る舞いは、単一の静的予測よりも軌跡全体を通じて調べた方がより判別可能であることが分かる。これらの時間的ダイナミクスを特徴付けるため、変動性、トレンド、定常状態の挙動を捉える信号処理特徴量を抽出し、将来のモデル状態の分離を大幅に改善する。また、2つの方法論的洞察を示す。第一に、テンプレートベースの訓練データは動的に生成されたモデル応答とほぼ同等の性能を達成し、高コストな初期推論とラベル付けを不要にする。第二に、プーリング操作の選択が極めて重要である。平均プーリングと最終トークン法はランダムに近い性能に低下する一方、最大プーリングは最大95%のAUROCを達成し、安定したプローブ軌跡を与える。安全性と数学の領域にわたる4つのデータセットと4つの推論モデルを用いて、軌跡特徴量がタスク固有のダイナミクスを符号化し、結果の分離性を向上させることを実証する。これらの知見は、プローブ軌跡をLRMの振る舞い監視のための補完的枠組みとして確立するものである。警告：本論文には潜在的危険な内容が含まれています。

English

Large Reasoning Models (LRMs) introduce new opportunities for safety monitoring through their Chain of Thought (CoT) reasoning. However, CoT is not always faithful to the model's final output, undermining its reliability as a monitoring tool. To address this, we investigate the hidden representations of LRMs to determine whether future behavior can be predicted from prompt and CoT representations. By evaluating a probe at each generated token, we construct a probe trajectory, the continuous evolution of a concept's probability across the reasoning process. We find that future model behavior is more distinguishable when examined over the full trajectory than from a single static prediction. To characterize these temporal dynamics, we extract signal-processing features that capture volatility, trend, and steady-state behavior, significantly improving the separation of future model states. We also present two methodological insights. First, template-based training data achieves near-parity with dynamically generated model responses, eliminating the need for a costly initial inference and labeling. Second, the choice of pooling operation is critical: average-pooling and last-token methods collapse to near-random performance, while max-pooling achieves up to 95% AUROC and yields stable probe trajectories. Using four datasets and four reasoning models across the domains of safety and mathematics, we demonstrate that trajectory features encode task-specific dynamics that improve outcome separability. These findings establish probe trajectories as a complementary framework for monitoring LRM behavior. Warning: This article contains potentially harmful content.