MUR: 대규모 언어 모델을 위한 모멘텀 불확실성 기반 추론

초록

대형 언어 모델(LLMs)은 추론 집약적인 작업에서 인상적인 성능을 달성했지만, 그들의 추론 효율성을 최적화하는 것은 여전히 해결해야 할 과제로 남아 있습니다. 테스트 시간 스케일링(TTS)은 추론 품질을 향상시키지만, 종종 과도한 사고를 유발하여 중복 계산에 토큰을 낭비하게 됩니다. 본 연구는 추가적인 학습 없이 LLM의 테스트 시간 스케일링을 효율적이고 적응적으로 안내하는 방법을 탐구합니다. 물리학의 운동량 개념에서 영감을 받아, 우리는 시간에 걸쳐 단계별 불확실성을 추적하고 집계함으로써 중요한 추론 단계에 사고 예산을 동적으로 할당하는 운동량 불확실성 기반 추론(MUR)을 제안합니다. 유연한 추론 시간 제어를 지원하기 위해, 단일 하이퍼파라미터를 통해 추론 예산을 조정하는 간단한 메커니즘인 감마 제어를 도입합니다. 우리는 MUR의 안정성과 편향 측면에서의 우수성을 뒷받침하기 위한 심층적인 이론적 증명을 제공합니다. MUR은 다양한 TTS 방법과 비교하여 네 가지 도전적인 벤치마크(MATH-500, AIME24, AIME25, GPQA-diamond)에서 최근의 Qwen3 모델(1.7B, 4B, 8B)을 사용하여 포괄적으로 평가되었습니다. 결과는 MUR이 평균적으로 계산량을 50% 이상 줄이면서 정확도를 0.62-3.37% 향상시킴을 보여줍니다.

English

Large Language Models (LLMs) have achieved impressive performance on reasoning-intensive tasks, yet optimizing their reasoning efficiency remains an open challenge. While Test-Time Scaling (TTS) improves reasoning quality, it often leads to overthinking, wasting tokens on redundant computations. This work investigates how to efficiently and adaptively guide LLM test-time scaling without additional training. Inspired by the concept of momentum in physics, we propose Momentum Uncertainty-guided Reasoning (MUR), which dynamically allocates thinking budgets to critical reasoning steps by tracking and aggregating stepwise uncertainty over time. To support flexible inference-time control, we introduce gamma-control, a simple mechanism that tunes the reasoning budget via a single hyperparameter. We provide in-depth theoretical proof to support the superiority of MUR in terms of stability and biases. MUR is comprehensively evaluated against various TTS methods across four challenging benchmarks (MATH-500, AIME24, AIME25, and GPQA-diamond) using different sizes of recent Qwen3 models (1.7B, 4B, and 8B). Results demonstrate that MUR reduces computation by over 50% on average while improving accuracy by 0.62-3.37%.

MUR: 대규모 언어 모델을 위한 모멘텀 불확실성 기반 추론

MUR: Momentum Uncertainty guided Reasoning for Large Language Models

초록

Support