Thinker: Imparare a Pensare Veloce e Lento

Abstract

Studi recenti dimostrano che le capacità di ragionamento dei Large Language Models (LLM) possono essere migliorate applicando il Reinforcement Learning (RL) a task di question-answering (QA) in aree come la matematica e la programmazione. Con un contesto di lunga durata, gli LLM possono imparare a eseguire ricerche, come indicato dal comportamento di autocorrezione osservato in DeepSeek R1. Tuttavia, questo comportamento di ricerca è spesso impreciso e manca di sicurezza, portando a risposte lunghe e ridondanti e mettendo in luce carenze nell'intuizione e nella verifica. Ispirati dalla Dual Process Theory in psicologia, introduciamo una semplice modifica al task di QA che include quattro fasi: Fast Thinking, in cui l'LLM deve rispondere entro un budget rigoroso di token; Verifica, in cui il modello valuta la sua risposta iniziale; Slow Thinking, in cui affina la risposta iniziale con maggiore deliberazione; e Riassunto, in cui sintetizza il perfezionamento della fase precedente in passaggi precisi. Il nostro task proposto migliora l'accuratezza media dal 24,9% al 27,9% per Qwen2.5-1.5B e dal 45,9% al 49,8% per DeepSeek-R1-Qwen-1.5B. In particolare, per Qwen2.5-1.5B, la modalità Fast Thinking da sola raggiunge un'accuratezza del 26,8% utilizzando meno di 1000 token, dimostrando sostanziali guadagni in efficienza inferenziale. Questi risultati suggeriscono che l'intuizione e il ragionamento deliberativo sono sistemi distinti e complementari che beneficiano di un training mirato.

English

Recent studies show that the reasoning capabilities of Large Language Models (LLMs) can be improved by applying Reinforcement Learning (RL) to question-answering (QA) tasks in areas such as math and coding. With a long context length, LLMs may learn to perform search, as indicated by the self-correction behavior observed in DeepSeek R1. However, this search behavior is often imprecise and lacks confidence, resulting in long, redundant responses and highlighting deficiencies in intuition and verification. Inspired by the Dual Process Theory in psychology, we introduce a simple modification to the QA task that includes four stages: Fast Thinking, where the LLM must answer within a strict token budget; Verification, where the model evaluates its initial response; Slow Thinking, where it refines the initial response with more deliberation; and Summarization, where it distills the refinement from the previous stage into precise steps. Our proposed task improves average accuracy from 24.9% to 27.9% for Qwen2.5-1.5B, and from 45.9% to 49.8% for DeepSeek-R1-Qwen-1.5B. Notably, for Qwen2.5-1.5B, the Fast Thinking mode alone achieves 26.8% accuracy using fewer than 1000 tokens, demonstrating substantial inference efficiency gains. These findings suggest that intuition and deliberative reasoning are distinct, complementary systems benefiting from targeted training.

Thinker: Imparare a Pensare Veloce e Lento

Thinker: Learning to Think Fast and Slow

Abstract

Support