作为学习任务的未来行为预测

摘要

对AI系统的信任通常建立在其运行机制的解释之上，人们据此预测系统对新输入的行为。对于大型推理模型（LRM）而言，这种传统路径尤其难以遵循：针对单个token生成的解释方法无法自然地推广至长序列轨迹，而轨迹本身若作为自然语言阅读往往不够忠实。我们提出一种绕过解释步骤的替代方案：将行为预测视为可学习任务，训练基于单条推理轨迹运行的行为预测器，使其做出通常需要通过解释才能获得的同类预测。该预测器的训练数据通过查询LRM获取（无需人工标注），其推理过程仅需单次前向传播。我们在两个任务上实例化该方法：预测LRM重复运行时答案的重复概率，以及输入部分删除后答案的变化。我们在三个不同的推理数据集上对这两个任务进行评估，发现训练后的行为预测器在读取相同轨迹时，其准确性优于作为朴素阅读者的GPT-5.4和Claude Opus-4.6，而推理成本仅为后者的极小部分。我们还发现，对主干网络进行端到端微调并初始化为目标LRM是取得优异性能的必要条件。这些结果表明，推理轨迹包含关于LRM未来行为的信息，其丰富性远超简单阅读所能传达的范畴。

English

Trust in an AI system is often anchored by explanations of how it works, which one then uses to forecast its behavior on new inputs. For large reasoning models (LRMs), this conventional route is particularly difficult to follow: explanation methods for single token generations do not naturally generalize to long trajectories, and the trajectories themselves are often not faithful when read as natural language. We propose an alternative that bypasses the explanation step: treat behavior forecasting as a learnable task and train Behavior Forecasters that operates on a single reasoning trajectory to make the same forecasts one would typically seek from an explanation. The forecaster's training data is obtained by querying the LRM with no human annotation, and its inference is done in a single forward pass. We instantiate this approach on two tasks: how likely the LRM is to repeat its answer on re-runs, and how removing parts of the input changes its answer. We evaluate this approach on both tasks across three diverse reasoning datasets and find that trained Behavior Forecasters are more accurate than GPT-5.4 and Claude Opus-4.6 reading the same trajectories as naive readers, at a small fraction of their inference cost. We find that fine-tuning the backbone end-to-end and initializing it from the target LRM are each necessary for strong performance. These results show that the reasoning trajectory carries information about the LRM's future behavior that goes beyond what naive reading conveys.