당신의 추론 모델은 생각을 멈춰야 할 시점을 암묵적으로 알고 있습니까?

초록

대규모 추론 모델(LRM)의 최근 발전은 긴 사고 연쇄(CoT)를 통해 복잡한 추론 과제에서의 성능을 크게 향상시켰습니다. 그러나 이러한 접근법은 상당한 중복성을 초래하여 계산 효율성을 저해하고 실시간 애플리케이션에서 심각한 지연을 유발하는 경우가 많습니다. 최근 연구에 따르면 긴 추론 사슬이 정확도와 종종 무관할 뿐만 아니라 오히려 정확도를 해칠 수 있는 것으로 나타났습니다. 이러한 현상을 보다 심층적으로 분석한 결과, 우리는 놀랍게도 LRM이 생각을 멈출 적절한 시기를 암묵적으로 알고 있으나, 이 능력이 현재의 샘플링 패러다임에 의해 가려져 있음을 실증적으로 확인했습니다. 이에 고무되어, 우리는 이러한 효율적인 추론 잠재력을 해방하는 새로운 샘플링 패러다임인 SAGE(자기 인식 기반 효율적 추론)를 제안합니다. 더 나아가 SAGE를 그룹 기반 강화 학습(SAGE-RL)에 혼합 샘플링으로 통합하면 SAGE-RL이 SAGE가 발견한 효율적인 추론 패턴을 표준 pass@1 추론에 효과적으로 접목하여 여러 도전적인 수학 벤치마크에서 LRM의 추론 정확도와 효율성을 현저히 향상시킬 수 있습니다.

English

Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through Long Chains of Thought (CoTs). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. Recent studies show that longer reasoning chains are frequently uncorrelated with correctness and can even be detrimental to accuracy. In a further in-depth analysis of this phenomenon, we surprisingly uncover and empirically verify that LRMs implicitly know the appropriate time to stop thinking, while this capability is obscured by current sampling paradigms. Motivated by this, we introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that unleashes this efficient reasoning potential. Furthermore, integrating SAGE as mixed sampling into group-based reinforcement learning (SAGE-RL) enables SAGE-RL to effectively incorporate SAGE-discovered efficient reasoning patterns into standard pass@1 inference, markedly enhancing both the reasoning accuracy and efficiency of LRMs across multiple challenging mathematical benchmarks.

당신의 추론 모델은 생각을 멈춰야 할 시점을 암묵적으로 알고 있습니까?

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

초록

Support