監測內在獨白：探針軌跡揭示推理動態

摘要

大型推理模型（LRMs）透過其思維鏈（Chain of Thought, CoT）推理，為安全監控帶來了新的契機。然而，CoT並非始終忠於模型的最終輸出，這削弱了其作為監控工具的可靠性。為解決此問題，我們研究LRM的隱藏表徵，以判斷能否從提示詞與CoT表徵中預測未來行為。透過在每個生成的詞元上評估探測器，我們建構出一條探測軌跡，即某個概念機率在推理過程中的連續演化。我們發現，相較於單次靜態預測，從完整軌跡上觀察時，未來模型行為更易於區分。為描述這些時間動態，我們提取了捕捉波動性、趨勢與穩態行為的信號處理特徵，從而顯著提升未來模型狀態的分離度。我們亦提出兩項方法學洞見：第一，基於模板的訓練資料可達到與動態生成模型回應近乎相當的表現，無需耗費高昂的初始推理與標註成本；第二，池化操作的選擇至關重要：平均池化與最後詞元方法會使效能降至接近隨機，而最大池化則能達到高達95%的AUROC，並產生穩定的探測軌跡。透過在安全與數學領域的四個資料集及四個推理模型上進行實驗，我們證明了軌跡特徵能編碼任務特定的動態特性，從而改善結果的可分離性。這些發現將探測軌跡確立為監控LRM行為的互補框架。警告：本文包含潛在有害內容。

English

Large Reasoning Models (LRMs) introduce new opportunities for safety monitoring through their Chain of Thought (CoT) reasoning. However, CoT is not always faithful to the model's final output, undermining its reliability as a monitoring tool. To address this, we investigate the hidden representations of LRMs to determine whether future behavior can be predicted from prompt and CoT representations. By evaluating a probe at each generated token, we construct a probe trajectory, the continuous evolution of a concept's probability across the reasoning process. We find that future model behavior is more distinguishable when examined over the full trajectory than from a single static prediction. To characterize these temporal dynamics, we extract signal-processing features that capture volatility, trend, and steady-state behavior, significantly improving the separation of future model states. We also present two methodological insights. First, template-based training data achieves near-parity with dynamically generated model responses, eliminating the need for a costly initial inference and labeling. Second, the choice of pooling operation is critical: average-pooling and last-token methods collapse to near-random performance, while max-pooling achieves up to 95% AUROC and yields stable probe trajectories. Using four datasets and four reasoning models across the domains of safety and mathematics, we demonstrate that trajectory features encode task-specific dynamics that improve outcome separability. These findings establish probe trajectories as a complementary framework for monitoring LRM behavior. Warning: This article contains potentially harmful content.