Erste Rückkehr, Entropie-anregende Exploration

papers.abstract

Reinforcement Learning from Verifiable Rewards (RLVR) verbessert die Argumentationsfähigkeiten von Large Language Models (LLMs), hat jedoch mit instabiler Exploration zu kämpfen. Wir schlagen FR3E (First Return, Entropy-Eliciting Explore) vor, ein strukturiertes Explorationsframework, das Entscheidungspunkte mit hoher Unsicherheit in Argumentationspfaden identifiziert und gezielte Rollouts durchführt, um semantisch fundiertes Zwischenfeedback zu erzeugen. Unsere Methode bietet gezielte Anleitung, ohne auf dichte Überwachung anzuweisen. Empirische Ergebnisse auf mathematischen Argumentationsbenchmarks (AIME24) zeigen, dass FR3E stabileres Training fördert, längere und kohärentere Antworten erzeugt und den Anteil vollständig korrekter Pfade erhöht. Diese Ergebnisse unterstreichen die Effektivität des Frameworks bei der Verbesserung der LLM-Argumentation durch robustere und strukturiertere Exploration.

English

Reinforcement Learning from Verifiable Rewards (RLVR) improves the reasoning abilities of Large Language Models (LLMs) but it struggles with unstable exploration. We propose FR3E (First Return, Entropy-Eliciting Explore), a structured exploration framework that identifies high-uncertainty decision points in reasoning trajectories and performs targeted rollouts to construct semantically grounded intermediate feedback. Our method provides targeted guidance without relying on dense supervision. Empirical results on mathematical reasoning benchmarks(AIME24) show that FR3E promotes more stable training, produces longer and more coherent responses, and increases the proportion of fully correct trajectories. These results highlight the framework's effectiveness in improving LLM reasoning through more robust and structured exploration.