대규모 추론 모델은 중단 가능한가?

초록

대규모 추론 모델(LRMs)은 복잡한 추론 작업에서 뛰어난 성능을 보이지만, 전통적으로 정적인 "고정된 세계" 설정에서 평가되어 왔습니다: 모델의 응답은 즉각적인 것으로 가정되며, 요청의 맥락은 응답 기간 동안 변하지 않는다고 전제됩니다. 이러한 가정은 단기 작업에서는 일반적으로 타당하지만, 모델이 문제를 고민하는 데 몇 시간이 걸리고 모델이 사고를 시작한 시점부터 최종 출력까지 코드가 크게 변경될 수 있는 보조 프로그래밍과 같은 현대적인 추론 작업에서는 "고정된 세계" 가정이 무너집니다. 본 연구에서는 이러한 고정된 세계 가정에 도전하고, 두 가지 현실적인 동적 시나리오에서 LRM의 견고성을 평가합니다: 중단(interruptions)은 제한된 예산 내에서 모델의 부분 출력의 품질을 테스트하고, 동적 맥락(dynamic context)은 진행 중인 변화에 대한 모델의 적응 능력을 테스트합니다. 장문 추론이 필요한 수학 및 프로그래밍 벤치마크에서 정적 평가는 일관적으로 견고성을 과대평가했습니다: 정적 설정에서 높은 정확도를 달성하는 최첨단 LRM조차도 중단되거나 변화하는 맥락에 노출될 때 예측 불가능하게 실패할 수 있으며, 추론 과정 후반에 업데이트가 도입되면 성능이 최대 60%까지 하락했습니다. 우리의 분석은 또한 여러 새로운 실패 모드를 밝혀냈습니다: 중단 시 모델이 추론을 최종 답변에 포함시키는 추론 누출(reasoning leakage), 시간 압박 하에서 모델이 추론을 완전히 포기하고 잘못된 답변을 반환하는 패닉(panic), 업데이트된 정보를 통합하면서 성능이 저하되는 자기 의심(self-doubt) 등이 그것입니다.

English

Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, "frozen world" settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the duration of the response. While generally true for short-term tasks, the "frozen world" assumption breaks down in modern reasoning tasks such as assistive programming, where models may take hours to think through problems and code may change dramatically from the time the model starts thinking to the model's final output. In this work, we challenge the frozen world assumption and evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the quality of the model's partial outputs on a limited budget, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades while incorporating updated information.

대규모 추론 모델은 중단 가능한가?

Are Large Reasoning Models Interruptible?

초록

Support