De Markoviaanse Denker

Samenvatting

Reinforcement learning (RL) is recentelijk een krachtige methode geworden voor het trainen van redenerende LLM's die lange ketens van gedachten (LongCoT) produceren. Echter maakt de standaard RL-"denkomgeving", waarin de staat bestaat uit de prompt plus alle voorgaande redeneertokens, de staat onbegrensd en dwingt het op aandacht gebaseerde beleid tot kwadratische rekenkracht naarmate gedachten langer worden. Wij herzien de omgeving zelf. Wij stellen Markoviaans Denken voor, een paradigma waarin het beleid redenering voortzet terwijl het conditioneert op een staat met constante grootte, waardoor de denklengte wordt losgekoppeld van de contextgrootte. Als direct gevolg levert dit lineaire rekenkracht op met constant geheugen. Wij concretiseren dit idee met Delethink, een RL-omgeving die redenering structureert in vaste grootte chunks. Binnen elke chunk denkt het model zoals gebruikelijk; aan de grens reset de omgeving de context en herinitialiseert de prompt met een korte overdracht. Via RL leert het beleid om een tekstuele staat nabij het einde van elke chunk te schrijven die voldoende is voor naadloze voortzetting van redenering na een reset. Een in deze omgeving getraind R1-Distill 1.5B-model redeneert in 8K-token chunks maar denkt tot 24K tokens, wat overeenkomt met of overtreft LongCoT-RL getraind met een 24K-budget. Met schaling tijdens testen blijft Delethink verbeteren waar LongCoT een plateau bereikt. Het effect van lineaire rekenkracht is aanzienlijk: wij schatten empirisch dat LongCoT-RL bij een gemiddelde denklengte van 96K 27 H100-maanden kost versus 7 voor Delethink. Analyse bij RL-initialisatie laat zien dat kant-en-klare redeneermodellen (1.5B-120B) vaak Markoviaanse sporen zero-shot bemonsteren over diverse benchmarks, wat positieve voorbeelden oplevert die RL effectief maken op schaal. Onze resultaten tonen aan dat het herontwerpen van de denkomgeving een krachtige hefboom is: het maakt zeer lange redenering mogelijk zonder kwadratische overhead en opent een pad naar efficiënte, schaalbare redenerende LLM's.

English

Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL "thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.

De Markoviaanse Denker

The Markovian Thinker

Samenvatting

Support