Fatto è meglio che perfetto: Sbloccare il ragionamento efficiente attraverso la decomposizione strutturata multi-turn

Abstract

I Large Reasoning Models (LRM) sono criticati per l'eccessiva lunghezza della Catena di Pensiero (Chain-of-Thought, CoT) necessaria per derivare la risposta finale, soffrendo di un'elevata latenza sia per il primo token che complessiva. Tipicamente, la CoT degli LRM mescola molteplici unità di pensiero; ciascuna unità tenta di produrre una risposta candidata alla query originale. Pertanto, un'idea naturale per migliorare l'efficienza è ridurre il numero di unità. Tuttavia, il fatto che le unità di pensiero nella CoT standard non possano essere gestite esplicitamente rende questa operazione complessa. Questo articolo introduce la Decomposizione Multi-Turn (MinD) per decodificare la CoT convenzionale in una sequenza di interazioni esplicite, strutturate e turno per turno, colmando così il divario. In MinD, il modello fornisce una risposta multi-turno alla query, dove ciascun turno abbraccia un'unità di pensiero e produce una risposta corrispondente. I turni successivi possono riflettere, verificare, revisionare o esplorare approcci alternativi sia per la parte di pensiero che per la risposta dei turni precedenti. Ciò non solo rende la risposta più rapida, ma consente anche un controllo esplicito sul processo di ragionamento iterativo (ad esempio, gli utenti possono fermarsi o continuare in qualsiasi turno). Seguiamo un paradigma di fine-tuning supervisionato (SFT) seguito da apprendimento per rinforzo (RL) per realizzare MinD. Inizialmente, riformuliamo gli output di un LRM in formati multi-turno utilizzando un altro LLM, e poi ottimizziamo l'LRM con tali dati. Osservando che il modello ottimizzato tende a consumare ancora più token rispetto a quello originale (probabilmente perché i formati multi-turno introducono token aggiuntivi per le risposte), suggeriamo di sfruttare algoritmi RL come GRPO per privilegiare output corretti con meno turni. Addestrato sul dataset MATH utilizzando modelli R1-Distill, MinD può raggiungere una riduzione fino a ~70% sia nell'uso dei token di output che nel tempo per il primo token (TTFT), mantenendo prestazioni competitive su benchmark di ragionamento come MATH-500, AIME24, AMC23 e GPQA-Diamond.

English

Large Reasoning Models (LRMs) are criticized for the excessively lengthy Chain-of-Thought (CoT) to derive the final answer, suffering from high first-token and overall latency. Typically, the CoT of LRMs mixes multiple thinking units; each unit attempts to produce a candidate answer to the original query. Hence, a natural idea to improve efficiency is to reduce the unit number. Yet, the fact that the thinking units in vanilla CoT cannot be explicitly managed renders doing so challenging. This paper introduces Multi-Turn Decomposition (MinD) to decode conventional CoT into a sequence of explicit, structured, and turn-wise interactions to bridge the gap. In MinD, the model provides a multi-turn response to the query, where each turn embraces a thinking unit and yields a corresponding answer. The subsequent turns can reflect, verify, revise, or explore alternative approaches to both the thinking and answer parts of earlier ones. This not only makes the answer delivered more swiftly, but also enables explicit controls over the iterative reasoning process (i.e., users may halt or continue at any turn). We follow a supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm to realize MinD. We first rephrase the outputs of an LRM into multi-turn formats by prompting another LLM, and then tune the LRM with such data. Observing that the tuned model tends to consume even more tokens than the original one (probably due to that the multi-turn formats introduce additional answer tokens), we advocate leveraging RL algorithms like GRPO to prioritize correct outputs with fewer turns. Trained on the MATH dataset using R1-Distill models, MinD can achieve up to ~70% reduction in both output token usage and time to first token (TTFT), while maintaining competitive performance on reasoning benchmarks such as MATH-500, AIME24, AMC23, and GPQA-Diamond.

Fatto è meglio che perfetto: Sbloccare il ragionamento efficiente attraverso la decomposizione strutturata multi-turn

Done Is Better than Perfect: Unlocking Efficient Reasoning by Structured Multi-Turn Decomposition

Abstract

Support