추론이 수렴할 때 멈추기: 추론 모델을 위한 의미 보존 조기 종료

초록

대규모 추론 모델(LRMs)은 긴 사고 사슬(CoT)을 생성함으로써 뛰어난 성능을 달성하지만, 해결책이 이미 안정화된 이후에도 추론을 계속하여 토큰을 낭비하고 지연 시간을 증가시키는 과잉 사고를 자주 보인다. 기존의 추론 시 조기 종료 방법은 주로 신뢰도나 시도-응답 일관성과 같은 응답 수준 신호에 의존하여 중단 시점을 결정한다. 그러나 이러한 신호는 주로 추론 수렴보다는 응답 준비 상태를 반영하므로, 모델이 탐색이나 자기 수정을 완료하기 전에 조기 종료가 유발될 수 있으며, 이는 최종 답변 정확도를 저하시키고 유지된 추론 체인을 의미론적으로 불완전하게 만든다. 본 연구에서는 의미 보존 조기 종료를 위한 보완 신호로서 추론 수준의 의미론적 중복을 식별한다: 연속적인 단계가 더 이상 새로운 진전을 추가하지 않고 기존 결론을 재방문할 때, 추론 궤적은 수렴되었을 가능성이 높다. 이 통찰을 바탕으로, 우리는 경량 중복 감지기(Redundancy Detector)와 응답 수준 검증을 결합한 플러그 앤 플레이 프레임워크인 PUMA를 제안한다. 감지기는 의미론적으로 중복된 후보 종료 지점을 식별하고, 검증은 중단이 안전한지 확인하여, PUMA가 응답 정확도와 일관된 추론 접두사를 모두 보존하면서 중복된 연속 부분을 제거할 수 있도록 한다. 다섯 개의 LRM과 다섯 개의 까다로운 추론 벤치마크에 걸쳐, PUMA는 정확도와 유지된 CoT 품질을 보존하면서 평균 26.2%의 토큰 감소를 달성한다. 코드 생성, 제로샷 시각-언어 추론, 학습된 중단 정책 내재화에 대한 추가 실험은 추론 수준의 중복이 효율적 추론을 위한 강건하고 전이 가능하며 학습 가능한 신호임을 더욱 입증한다. 본 코드는 https://github.com/giovanni-vaccarino/PUMA에서 확인할 수 있다.

English

Large Reasoning Models (LRMs) achieve strong performance by generating long chains of thought (CoT), but often overthink, continuing to reason after a solution has already stabilized and thereby wasting tokens and increasing latency. Existing inference-time early-exit methods rely primarily on answer-level signals, such as confidence or trial-answer consistency, to decide when to stop. However, these signals mainly reflect answer readiness rather than reasoning convergence: they may trigger before the model has finished exploring or self-correcting, causing premature exits that can degrade final-answer accuracy and leave the retained reasoning chain semantically incomplete. We identify reasoning-level semantic redundancy as a complementary signal for semantic-preserving early exit: when successive steps no longer add novel progress and instead revisit established conclusions, the reasoning trajectory has likely converged. Building on this insight, we propose PUMA, a plug-and-play framework that combines a lightweight Redundancy Detector with answer-level verification. The detector flags semantically redundant candidate exits, while verification confirms whether stopping is safe, allowing PUMA to remove redundant continuation while preserving both answer accuracy and a coherent reasoning prefix. Across five LRMs and five challenging reasoning benchmarks, PUMA achieves 26.2% average token reduction while preserving accuracy and retained CoT quality. Additional experiments on code generation, zero-shot vision-language reasoning, and learned stopping-policy internalization further demonstrate that reasoning-level redundancy is a robust, transferable, and learnable signal for efficient reasoning. Our code is available at https://github.com/giovanni-vaccarino/PUMA.