平衡万岁：信息瓶颈驱动的基于树的策略优化

摘要

近期，面向大语言模型（LLMs）的在线强化学习（RL）进展已在复杂推理任务中展现出令人期待的性能。然而，这些方法常呈现探索-利用权衡失衡的问题，导致优化不稳定且性能次优。我们提出IB-Score——一种基于信息瓶颈理论的新型度量指标，通过量化步骤级推理多样性与正确答案共享互信息之间的权衡，评估策略的探索-利用平衡性。基于IB-Score的分析表明，采用常见正则化项的流行在线RL方法（如GRPO）在训练过程中无法持续维持平衡，导致次优结果。为解决此问题，我们提出信息瓶颈驱动的树基策略优化（IB-TPO），这是一个原则性框架，将IB-Score作为细粒度优化目标，并采用新颖的IB引导树采样策略——该策略不仅能在相同token预算下提升在线采样效率（多生成50%的轨迹），还能复用树结构实现高效的IB-Score蒙特卡洛估计。在标准基准上的大量实验表明，我们的方法相比GRPO基线显著提升2.9%至3.6%，同时优于其他主流在线RL方法。代码已开源：https://github.com/alibaba/EfficientRL。

English

Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-optimal performance. We introduce IB-Score, a novel metric grounded in Information Bottleneck theory that evaluates policy's exploration-exploitation balance by quantifying the trade-off between step-level reasoning diversity and mutual information shared with the correct answer. Analysis based on IB-Score shows that popular online RL approaches (e.g., GRPO) with common regularizers fail to consistently maintain balance during training with suboptimal results. To address this, we propose Information Bottleneck-driven Tree-based Policy Optimization (IB-TPO), a principled framework that formulates IB-Score as a fine-grained optimization objective and utilizes a novel IB-guided tree sampling strategy that not only improves the efficiency of online sampling with 50% more trajectories under the same token budget, but also reuses the tree structure for effective IB-Score Monte Carlo estimation. Extensive experiments across standard benchmarks show that our method significantly outperforms GRPO baseline by 2.9% to 3.6% and also outperforms other state-of-the-art online RL approaches. Our code is available at https://github.com/alibaba/EfficientRL.