과도한 사고에서 벗어나도록 LLM을 자가 제동 튜닝으로 조정하기

초록

OpenAI o1 및 DeepSeek-R1과 같은 대규모 추론 모델(LRMs)은 더 긴 사고 사슬을 생성함으로써 추론 능력을 크게 향상시켰으며, 다양한 작업에서 뛰어난 성능을 보여주고 있습니다. 그러나 이러한 성능 향상은 생성 과정에서 상당한 수준의 중복 추론이 증가하는 대가를 치르게 되며, 이는 높은 계산 비용을 초래하고 과도한 사고(overthinking) 문제를 악화시킵니다. 기존의 많은 접근법들이 과도한 사고 문제를 해결하려고 시도했지만, 이들은 종종 외부 개입에 의존합니다. 본 논문에서는 모델이 스스로 추론 과정을 조절할 수 있도록 함으로써 외부 제어 메커니즘에 대한 의존을 없애는 새로운 프레임워크인 Self-Braking Tuning(SBT)을 제안합니다. 우리는 표준 답변을 기반으로 과도한 사고 식별 지표 세트를 구성하고, 중복 추론을 감지하기 위한 체계적인 방법을 설계합니다. 이 방법은 추론 궤적 내에서 불필요한 단계를 정확히 식별하고, 자기 조절 행동을 학습하기 위한 훈련 신호를 생성합니다. 이를 바탕으로, 적응형 추론 길이를 가진 데이터를 구성하기 위한 완전한 전략을 개발하고, 모델이 적절한 시점에서 추론을 종료할 시기를 자연스럽게 학습할 수 있도록 하는 혁신적인 브레이킹 프롬프트 메커니즘을 도입합니다. 수학 벤치마크(AIME, AMC, MATH500, GSM8K)에서의 실험 결과, 우리의 방법은 제약 없는 모델과 비슷한 정확도를 유지하면서 토큰 소비를 최대 60%까지 줄이는 것으로 나타났습니다.

English

Large reasoning models (LRMs), such as OpenAI o1 and DeepSeek-R1, have significantly enhanced their reasoning capabilities by generating longer chains of thought, demonstrating outstanding performance across a variety of tasks. However, this performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process, leading to high computational overhead and exacerbating the issue of overthinking. Although numerous existing approaches aim to address the problem of overthinking, they often rely on external interventions. In this paper, we propose a novel framework, Self-Braking Tuning (SBT), which tackles overthinking from the perspective of allowing the model to regulate its own reasoning process, thus eliminating the reliance on external control mechanisms. We construct a set of overthinking identification metrics based on standard answers and design a systematic method to detect redundant reasoning. This method accurately identifies unnecessary steps within the reasoning trajectory and generates training signals for learning self-regulation behaviors. Building on this foundation, we develop a complete strategy for constructing data with adaptive reasoning lengths and introduce an innovative braking prompt mechanism that enables the model to naturally learn when to terminate reasoning at an appropriate point. Experiments across mathematical benchmarks (AIME, AMC, MATH500, GSM8K) demonstrate that our method reduces token consumption by up to 60% while maintaining comparable accuracy to unconstrained models.

과도한 사고에서 벗어나도록 LLM을 자가 제동 튜닝으로 조정하기

Let LLMs Break Free from Overthinking via Self-Braking Tuning

초록

Support