Leve het evenwicht: Informatieknelpuntgedreven boomgebaseerde beleidsoptimalisatie

Samenvatting

Recente ontwikkelingen in online reinforcement learning (RL) voor grote taalmodellen (LLMs) hebben veelbelovende prestaties laten zien bij complexe redeneertaken. Ze vertonen echter vaak een onevenwichtige exploratie-exploitatieruimte, wat leidt tot instabiele optimalisatie en suboptimale prestaties. We introduceren IB-Score, een nieuwe metriek geworteld in de Information Bottleneck-theorie, die de exploratie-exploitatieruimte van een beleid evalueert door de afweging tussen stapsgewijze redeneerdiversiteit en de wederzijdse informatie met het juiste antwoord te kwantificeren. Analyse op basis van IB-Score laat zien dat populaire online RL-benaderingen (bijv. GRPO) met gangbare regularizers er niet in slagen consequent een evenwicht te behouden tijdens de training, wat leidt tot suboptimale resultaten. Om dit aan te pakken stellen we Information Bottleneck-gedreven Tree-based Policy Optimization (IB-TPO) voor, een principieel raamwerk dat IB-Score formuleert als een fijnmazige optimalisatiedoelstelling en een nieuwe IB-geleide boomsteekproefstrategie gebruikt die niet alleen de efficiëntie van online sampling verbetert met 50% meer trajecten onder hetzelfde tokenbudget, maar ook de boomstructuur hergebruikt voor effectieve Monte Carlo-schatting van IB-Score. Uitgebreide experimenten met standaard benchmarks tonen aan dat onze methode de GRPO-baseline met 2,9% tot 3,6% aanzienlijk overtreft en ook andere state-of-the-art online RL-benaderingen overtreft. Onze code is beschikbaar op https://github.com/alibaba/EfficientRL.

English

Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-optimal performance. We introduce IB-Score, a novel metric grounded in Information Bottleneck theory that evaluates policy's exploration-exploitation balance by quantifying the trade-off between step-level reasoning diversity and mutual information shared with the correct answer. Analysis based on IB-Score shows that popular online RL approaches (e.g., GRPO) with common regularizers fail to consistently maintain balance during training with suboptimal results. To address this, we propose Information Bottleneck-driven Tree-based Policy Optimization (IB-TPO), a principled framework that formulates IB-Score as a fine-grained optimization objective and utilizes a novel IB-guided tree sampling strategy that not only improves the efficiency of online sampling with 50% more trajectories under the same token budget, but also reuses the tree structure for effective IB-Score Monte Carlo estimation. Extensive experiments across standard benchmarks show that our method significantly outperforms GRPO baseline by 2.9% to 3.6% and also outperforms other state-of-the-art online RL approaches. Our code is available at https://github.com/alibaba/EfficientRL.