大型推理模型是否具有可中斷性?
Are Large Reasoning Models Interruptible?
October 13, 2025
作者: Tsung-Han Wu, Mihran Miroyan, David M. Chan, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez
cs.AI
摘要
大型推理模型(LRMs)在複雜推理任務中表現卓越,但傳統上是在靜態的「凍結世界」設定中進行評估:模型回應被假設為即時產生,且請求的上下文在回應期間被認為是固定不變的。雖然這對於短期任務通常成立,但在現代推理任務(如輔助編程)中,「凍結世界」假設便不再適用,因為模型可能需要數小時來思考問題,且從模型開始思考到最終輸出期間,程式碼可能會發生巨大變化。在本研究中,我們挑戰了凍結世界假設,並在兩種現實的動態情境下評估了LRM的魯棒性:中斷(測試模型在有限預算下的部分輸出質量)和動態上下文(測試模型對即時變化的適應能力)。在需要長篇推理的數學和編程基準測試中,靜態評估一致性地高估了魯棒性:即使在靜態設定中達到高準確率的最先進LRMs,在中斷或暴露於變化的上下文時,也可能會不可預測地失敗,當更新在推理過程的後期引入時,性能下降可達60%。我們的分析進一步揭示了幾種新的失敗模式,包括推理洩漏(模型在中斷時將推理過程摺疊到最終答案中)、恐慌(在時間壓力下模型完全放棄推理並返回錯誤答案)以及自我懷疑(在整合更新資訊時性能下降)。
English
Large Reasoning Models (LRMs) excel at complex reasoning but are
traditionally evaluated in static, "frozen world" settings: model responses are
assumed to be instantaneous, and the context of a request is presumed to be
immutable over the duration of the response. While generally true for
short-term tasks, the "frozen world" assumption breaks down in modern reasoning
tasks such as assistive programming, where models may take hours to think
through problems and code may change dramatically from the time the model
starts thinking to the model's final output. In this work, we challenge the
frozen world assumption and evaluate LRM robustness under two realistic dynamic
scenarios: interruptions, which test the quality of the model's partial outputs
on a limited budget, and dynamic context, which tests model adaptation to
in-flight changes. Across mathematics and programming benchmarks that require
long-form reasoning, static evaluations consistently overestimate robustness:
even state-of-the-art LRMs, which achieve high accuracy in static settings, can
fail unpredictably when interrupted or exposed to changing context, with
performance dropping by up to 60% when updates are introduced late in the
reasoning process. Our analysis further reveals several novel failure modes,
including reasoning leakage, where models fold the reasoning into their final
answer when interrupted; panic, where under time pressure models abandon
reasoning entirely and return incorrect answers; and self-doubt, where
performance degrades while incorporating updated information.