RLP: Verstärkungslernen als Vorausbildungsobjektiv

papers.abstract

Das dominante Paradigma für das Training großer Reasoning-Modelle beginnt mit einem Pre-Training unter Verwendung eines Next-Token-Prediction-Loss auf riesigen Datenmengen. Reinforcement Learning, obwohl leistungsstark bei der Skalierung von Reasoning, wird erst in der allerletzten Phase des Post-Trainings eingeführt, nachdem ein überwachtes Fine-Tuning stattgefunden hat. Doch ist dies wirklich der optimale Trainingsansatz? In diesem Artikel stellen wir RLP vor, ein informationsgetriebenes Reinforcement-Pretraining-Ziel, das den Kerngeist des Reinforcement Learning – Exploration – in die letzte Phase des Pre-Trainings bringt. Die zentrale Idee besteht darin, Chain-of-Thought als explorative Aktion zu behandeln, wobei Belohnungen basierend auf dem Informationsgewinn berechnet werden, den sie für die Vorhersage zukünftiger Tokens liefert. Dieses Trainingsziel ermutigt das Modell im Wesentlichen dazu, selbstständig zu denken, bevor es vorhersagt, was als Nächstes kommt, und lehrt somit ein unabhängiges Denkverhalten bereits früher im Pre-Training. Konkret misst das Belohnungssignal den Anstieg der Log-Likelihood des nächsten Tokens, wenn sowohl auf den Kontext als auch auf eine gesampelte Reasoning-Kette konditioniert wird, im Vergleich zur Konditionierung allein auf den Kontext. Dieser Ansatz liefert ein verifikatorfreies, dichtes Belohnungssignal, das ein effizientes Training für den gesamten Dokumentenstrom während des Pre-Trainings ermöglicht. Insbesondere reformuliert RLP Reinforcement Learning für Reasoning als ein Pre-Training-Ziel auf gewöhnlichem Text und schließt so die Lücke zwischen Next-Token-Prediction und der Entstehung nützlicher Chain-of-Thought-Reasoning. Das Pre-Training mit RLP auf Qwen3-1.7B-Base steigert den Gesamtdurchschnitt über eine acht Benchmark umfassende Mathematik- und Wissenschaftssuite um 19%. Bei identischem Post-Training verstärken sich die Gewinne, wobei die größten Verbesserungen bei reasoning-lastigen Aufgaben wie AIME25 und MMLU-Pro zu verzeichnen sind. Die Anwendung von RLP auf das hybride Nemotron-Nano-12B-v2 erhöht den Gesamtdurchschnitt von 42,81% auf 61,32% und steigert den Durchschnitt beim wissenschaftlichen Reasoning um 23%, was die Skalierbarkeit über Architekturen und Modellgrößen hinweg demonstriert.

English

The dominant paradigm for training large reasoning models starts with pre-training using next-token prediction loss on vast amounts of data. Reinforcement learning, while powerful in scaling reasoning, is introduced only as the very last phase of post-training, preceded by supervised fine-tuning. While dominant, is this an optimal way of training? In this paper, we present RLP, an information-driven reinforcement pretraining objective, that brings the core spirit of reinforcement learning -- exploration -- to the last phase of pretraining. The key idea is to treat chain-of-thought as an exploratory action, with rewards computed based on the information gain it provides for predicting future tokens. This training objective essentially encourages the model to think for itself before predicting what comes next, thus teaching an independent thinking behavior earlier in the pretraining. More concretely, the reward signal measures the increase in log-likelihood of the next token when conditioning on both context and a sampled reasoning chain, compared to conditioning on context alone. This approach yields a verifier-free dense reward signal, allowing for efficient training for the full document stream during pretraining. Specifically, RLP reframes reinforcement learning for reasoning as a pretraining objective on ordinary text, bridging the gap between next-token prediction and the emergence of useful chain-of-thought reasoning. Pretraining with RLP on Qwen3-1.7B-Base lifts the overall average across an eight-benchmark math-and-science suite by 19%. With identical post-training, the gains compound, with the largest improvements on reasoning-heavy tasks such as AIME25 and MMLU-Pro. Applying RLP to the hybrid Nemotron-Nano-12B-v2 increases the overall average from 42.81% to 61.32% and raises the average on scientific reasoning by 23%, demonstrating scalability across architectures and model sizes.

RLP: Verstärkungslernen als Vorausbildungsobjektiv

RLP: Reinforcement as a Pretraining Objective

papers.abstract

Support