监测内心独白：探针轨迹揭示推理动态

摘要

大型推理模型（LRMs）通过其链式思维（CoT）推理为安全监控带来了新机遇。然而，CoT并不总是忠实于模型的最终输出，这削弱了其作为监控工具的可靠性。为解决这一问题，我们探究LRMs的隐藏表征，以判断是否可以从提示和CoT表征中预测未来行为。通过在每个生成token处评估探针，我们构建了一条探针轨迹——即某个概念的概率在推理过程中连续演变的轨迹。我们发现，相较于单次静态预测，从完整轨迹上考察未来模型行为更具区分性。为刻画这些时间动态特征，我们提取了捕捉波动性、趋势和稳态行为的信号处理特征，显著提升了未来模型状态的可分离性。此外，我们得到两点方法学启示：其一，基于模板的训练数据能够达到与动态生成模型响应近乎等同的效果，从而省去了代价高昂的初始推理和标注步骤；其二，池化操作的选择至关重要——平均池化和最后token方法的效果退化至近乎随机，而最大池化可达95%的AUROC，并能生成稳定的探针轨迹。我们使用涵盖安全与数学领域的四个数据集和四个推理模型证明：轨迹特征编码了任务特定的动态信息，从而提升了结果的可分离性。这些发现确立了探针轨迹作为监控LRM行为的一种补充框架。警告：本文包含潜在有害内容。

English

Large Reasoning Models (LRMs) introduce new opportunities for safety monitoring through their Chain of Thought (CoT) reasoning. However, CoT is not always faithful to the model's final output, undermining its reliability as a monitoring tool. To address this, we investigate the hidden representations of LRMs to determine whether future behavior can be predicted from prompt and CoT representations. By evaluating a probe at each generated token, we construct a probe trajectory, the continuous evolution of a concept's probability across the reasoning process. We find that future model behavior is more distinguishable when examined over the full trajectory than from a single static prediction. To characterize these temporal dynamics, we extract signal-processing features that capture volatility, trend, and steady-state behavior, significantly improving the separation of future model states. We also present two methodological insights. First, template-based training data achieves near-parity with dynamically generated model responses, eliminating the need for a costly initial inference and labeling. Second, the choice of pooling operation is critical: average-pooling and last-token methods collapse to near-random performance, while max-pooling achieves up to 95% AUROC and yields stable probe trajectories. Using four datasets and four reasoning models across the domains of safety and mathematics, we demonstrate that trajectory features encode task-specific dynamics that improve outcome separability. These findings establish probe trajectories as a complementary framework for monitoring LRM behavior. Warning: This article contains potentially harmful content.