Es lebe die Balance: Informationsengpass-getriebene baumbasierte Policy-Optimierung

Zusammenfassung

Jüngste Fortschritte im Online Reinforcement Learning (RL) für große Sprachmodelle (LLMs) haben vielversprechende Leistungen bei komplexen Denkaufgaben gezeigt. Allerdings weisen sie häufig ein unausgewogenes Verhältnis zwischen Exploration und Exploitation auf, was zu instabiler Optimierung und suboptimaler Leistung führt. Wir führen IB-Score ein, eine neuartige Metrik, die auf der Information Bottleneck-Theorie basiert und die Explorations-Exploitations-Balance einer Policy bewertet, indem sie den Trade-off zwischen schrittweiser Reasoning-Diversität und der mit der richtigen Antwort geteilten Transinformation quantifiziert. Eine auf IB-Score basierende Analyse zeigt, dass gängige Online-RL-Ansätze (z. B. GRPO) mit üblichen Regularisierern während des Trainings nicht durchgängig die Balance halten können, was zu suboptimalen Ergebnissen führt. Um dies zu adressieren, schlagen wir die Information Bottleneck-gesteuerte baumbasierte Policy-Optimierung (IB-TPO) vor, ein prinzipienbasiertes Rahmenwerk, das IB-Score als feinkörniges Optimierungsziel formuliert und eine neuartige IB-geführte Baum-Stichprobenstrategie nutzt. Diese verbessert nicht nur die Effizienz des Online-Samplings um 50 % mehr Trajektorien bei gleichem Token-Budget, sondern nutzt die Baumstruktur auch für eine effektive Monte-Carlo-Schätzung des IB-Scores wieder. Umfangreiche Experimente mit Standard-Benchmarks zeigen, dass unsere Methode die GRPO-Baseline signifikant um 2,9 % bis 3,6 % übertrifft und auch andere hochmoderne Online-RL-Ansätze übertrifft. Unser Code ist verfügbar unter https://github.com/alibaba/EfficientRL.

English

Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-optimal performance. We introduce IB-Score, a novel metric grounded in Information Bottleneck theory that evaluates policy's exploration-exploitation balance by quantifying the trade-off between step-level reasoning diversity and mutual information shared with the correct answer. Analysis based on IB-Score shows that popular online RL approaches (e.g., GRPO) with common regularizers fail to consistently maintain balance during training with suboptimal results. To address this, we propose Information Bottleneck-driven Tree-based Policy Optimization (IB-TPO), a principled framework that formulates IB-Score as a fine-grained optimization objective and utilizes a novel IB-guided tree sampling strategy that not only improves the efficiency of online sampling with 50% more trajectories under the same token budget, but also reuses the tree structure for effective IB-Score Monte Carlo estimation. Extensive experiments across standard benchmarks show that our method significantly outperforms GRPO baseline by 2.9% to 3.6% and also outperforms other state-of-the-art online RL approaches. Our code is available at https://github.com/alibaba/EfficientRL.