Cosa caratterizza un ragionamento efficace? Riconsiderare lunghezza, revisione e struttura del CoT

Abstract

I grandi modelli di ragionamento (LRM) dedicano una quantità significativa di risorse computazionali durante il test a lunghe tracce di ragionamento a catena (CoT), ma ciò che *caratterizza* una CoT efficace rimane poco chiaro. Mentre lavori precedenti riportano miglioramenti derivanti dall'allungamento delle CoT e dall'aumento della revisione (rivisitazione dei passaggi precedenti) tramite l'aggiunta di token di *attesa*, studi recenti suggeriscono che un ragionamento più breve può superare tracce più lunghe. Pertanto, conduciamo una valutazione sistematica su dieci LRM nel contesto del ragionamento matematico e scientifico. Contrariamente alla narrativa del "più lungo è meglio", scopriamo che sia l'allungamento ingenuo delle CoT sia l'aumento della revisione sono associati a una precisione *inferiore*. Man mano che la CoT si sviluppa passo dopo passo, le metriche a livello di token possono confondere la verbosità con la qualità del processo. Introduciamo una visione a grafo della CoT per estrarne la struttura e identificare una singola statistica—la *Frazione di Passaggi Falliti (FSF)*, la frazione di passaggi nei rami abbandonati—che supera costantemente la lunghezza e il rapporto di revisione nel predire la correttezza tra i modelli. Per indagare la causalità, progettiamo due interventi. Primo, classifichiamo le CoT candidate in base a ciascuna metrica durante il test, dove la FSF produce i maggiori guadagni in termini di pass@1; secondo, modifichiamo le CoT rimuovendo i rami falliti, il che migliora significativamente la precisione, indicando che i rami falliti influenzano negativamente il ragionamento successivo. Nel complesso, questi risultati caratterizzano le CoT efficaci come quelle che *falliscono meno* e supportano un ridimensionamento durante il test *consapevole della struttura* rispetto alla generazione indiscriminata di CoT lunghe.

English

Large reasoning models (LRMs) spend substantial test-time compute on long chain-of-thought (CoT) traces, but what *characterizes* an effective CoT remains unclear. While prior work reports gains from lengthening CoTs and increasing review (revisiting earlier steps) via appended *wait* tokens, recent studies suggest that shorter thinking can outperform longer traces. We therefore conduct a systematic evaluation across ten LRMs on math and scientific reasoning. Contrary to the "longer-is-better" narrative, we find that both naive CoT lengthening and increased review are associated with *lower* accuracy. As CoT unfolds step by step, token-level metrics can conflate verbosity with process quality. We introduce a graph view of CoT to extract structure and identify a single statistic-the *Failed-Step Fraction (FSF)*, the fraction of steps in abandoned branches-that consistently outpredicts length and review ratio for correctness across models. To probe causality, we design two interventions. First, we rank candidate CoTs by each metric at test time, where FSF yields the largest pass@1 gains; second, we edit CoTs to remove failed branches, which significantly improves accuracy, indicating that failed branches bias subsequent reasoning. Taken together, these results characterize effective CoTs as those that *fail less* and support *structure-aware* test-time scaling over indiscriminately generating long CoT.

Cosa caratterizza un ragionamento efficace? Riconsiderare lunghezza, revisione e struttura del CoT

What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT

Abstract

Support