大型推理模型是否具备可中断性?
Are Large Reasoning Models Interruptible?
October 13, 2025
作者: Tsung-Han Wu, Mihran Miroyan, David M. Chan, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez
cs.AI
摘要
大型推理模型(LRMs)在复杂推理任务中表现出色,但传统上是在静态的“冻结世界”环境中进行评估的:模型响应被假定为瞬时完成,且请求的上下文在响应期间被认为是不变的。尽管这一假设在短期任务中普遍成立,但在现代推理任务(如辅助编程)中,“冻结世界”假设便不再适用,因为模型可能需要数小时来思考问题,且从模型开始思考到最终输出期间,代码可能发生显著变化。在本研究中,我们挑战了冻结世界假设,并在两种现实的动态场景下评估了LRM的鲁棒性:中断测试,即在有限预算下检验模型部分输出的质量;动态上下文测试,即检验模型对实时变化的适应能力。在需要长篇推理的数学和编程基准测试中,静态评估一致性地高估了鲁棒性:即便是在静态设置下达到高准确率的最先进LRMs,在遭遇中断或面对变化上下文时也可能不可预测地失败,当更新在推理过程后期引入时,性能下降幅度可达60%。我们的分析进一步揭示了几种新的失败模式,包括推理泄露,即模型在中断时将推理过程融入最终答案;恐慌,即在时间压力下模型完全放弃推理并返回错误答案;以及自我怀疑,即在整合更新信息时性能下降。
English
Large Reasoning Models (LRMs) excel at complex reasoning but are
traditionally evaluated in static, "frozen world" settings: model responses are
assumed to be instantaneous, and the context of a request is presumed to be
immutable over the duration of the response. While generally true for
short-term tasks, the "frozen world" assumption breaks down in modern reasoning
tasks such as assistive programming, where models may take hours to think
through problems and code may change dramatically from the time the model
starts thinking to the model's final output. In this work, we challenge the
frozen world assumption and evaluate LRM robustness under two realistic dynamic
scenarios: interruptions, which test the quality of the model's partial outputs
on a limited budget, and dynamic context, which tests model adaptation to
in-flight changes. Across mathematics and programming benchmarks that require
long-form reasoning, static evaluations consistently overestimate robustness:
even state-of-the-art LRMs, which achieve high accuracy in static settings, can
fail unpredictably when interrupted or exposed to changing context, with
performance dropping by up to 60% when updates are introduced late in the
reasoning process. Our analysis further reveals several novel failure modes,
including reasoning leakage, where models fold the reasoning into their final
answer when interrupted; panic, where under time pressure models abandon
reasoning entirely and return incorrect answers; and self-doubt, where
performance degrades while incorporating updated information.