완벽보다 완료가 낫다: 구조화된 다중 터논리 분해를 통한 효율적 추론의 해방

초록

대형 추론 모델(Large Reasoning Models, LRMs)은 최종 답변을 도출하기 위해 과도하게 긴 사고의 연쇄(Chain-of-Thought, CoT)를 사용하며, 이로 인해 첫 토큰 지연 시간과 전체 지연 시간이 높다는 비판을 받고 있습니다. 일반적으로 LRM의 CoT는 여러 사고 단위를 혼합하는데, 각 단위는 원래 질문에 대한 후보 답변을 생성하려고 시도합니다. 따라서 효율성을 개선하기 위한 자연스러운 아이디어는 이러한 단위의 수를 줄이는 것입니다. 그러나 기존 CoT의 사고 단위가 명시적으로 관리될 수 없다는 사실은 이를 어렵게 만듭니다. 본 논문은 이러한 격차를 해소하기 위해 기존 CoT를 명시적이고 구조화된 턴별 상호작용의 시퀀스로 디코딩하는 다중 턴 분해(Multi-Turn Decomposition, MinD)를 소개합니다. MinD에서 모델은 질문에 대해 다중 턴 응답을 제공하며, 각 턴은 하나의 사고 단위를 포함하고 해당 답변을 생성합니다. 후속 턴은 이전 턴의 사고 부분과 답변 부분을 반영, 검증, 수정하거나 대안적인 접근 방식을 탐색할 수 있습니다. 이는 답변을 더 빠르게 전달할 뿐만 아니라 반복적인 추론 과정에 대한 명시적 제어를 가능하게 합니다(즉, 사용자는 언제든지 중단하거나 계속할 수 있습니다). 우리는 MinD를 구현하기 위해 지도 미세 조정(Supervised Fine-Tuning, SFT) 후 강화 학습(Reinforcement Learning, RL) 패러다임을 따릅니다. 먼저 다른 대형 언어 모델(LLM)을 사용하여 LRM의 출력을 다중 턴 형식으로 재구성한 후, 이러한 데이터로 LRM을 조정합니다. 조정된 모델이 원래 모델보다 더 많은 토큰을 소비하는 경향이 있음을 관찰한 후(아마도 다중 턴 형식이 추가적인 답변 토큰을 도입했기 때문), 우리는 GRPO와 같은 RL 알고리즘을 활용하여 더 적은 턴으로 정확한 출력을 우선시하도록 권장합니다. R1-Distill 모델을 사용하여 MATH 데이터셋에서 학습된 MinD는 출력 토큰 사용량과 첫 토큰 시간(Time to First Token, TTFT)을 최대 ~70%까지 줄이면서도 MATH-500, AIME24, AMC23, GPQA-Diamond와 같은 추론 벤치마크에서 경쟁력 있는 성능을 유지할 수 있습니다.

English

Large Reasoning Models (LRMs) are criticized for the excessively lengthy Chain-of-Thought (CoT) to derive the final answer, suffering from high first-token and overall latency. Typically, the CoT of LRMs mixes multiple thinking units; each unit attempts to produce a candidate answer to the original query. Hence, a natural idea to improve efficiency is to reduce the unit number. Yet, the fact that the thinking units in vanilla CoT cannot be explicitly managed renders doing so challenging. This paper introduces Multi-Turn Decomposition (MinD) to decode conventional CoT into a sequence of explicit, structured, and turn-wise interactions to bridge the gap. In MinD, the model provides a multi-turn response to the query, where each turn embraces a thinking unit and yields a corresponding answer. The subsequent turns can reflect, verify, revise, or explore alternative approaches to both the thinking and answer parts of earlier ones. This not only makes the answer delivered more swiftly, but also enables explicit controls over the iterative reasoning process (i.e., users may halt or continue at any turn). We follow a supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm to realize MinD. We first rephrase the outputs of an LRM into multi-turn formats by prompting another LLM, and then tune the LRM with such data. Observing that the tuned model tends to consume even more tokens than the original one (probably due to that the multi-turn formats introduce additional answer tokens), we advocate leveraging RL algorithms like GRPO to prioritize correct outputs with fewer turns. Trained on the MATH dataset using R1-Distill models, MinD can achieve up to ~70% reduction in both output token usage and time to first token (TTFT), while maintaining competitive performance on reasoning benchmarks such as MATH-500, AIME24, AMC23, and GPQA-Diamond.

완벽보다 완료가 낫다: 구조화된 다중 터논리 분해를 통한 효율적 추론의 해방

Done Is Better than Perfect: Unlocking Efficient Reasoning by Structured Multi-Turn Decomposition

초록

Support