平衡万岁:信息瓶颈驱动的基于树的策略优化
Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization
May 27, 2026
作者: Hao Jiang, Shurui Li, Tianpeng Bu, Bowen Xu, Xin Liu, Qihua Chen, Hongtao Duan, Lulu Hu, Bin Yang, Minying Zhang
cs.AI
摘要
近期,面向大语言模型(LLMs)的在线强化学习(RL)进展已在复杂推理任务中展现出令人期待的性能。然而,这些方法常呈现探索-利用权衡失衡的问题,导致优化不稳定且性能次优。我们提出IB-Score——一种基于信息瓶颈理论的新型度量指标,通过量化步骤级推理多样性与正确答案共享互信息之间的权衡,评估策略的探索-利用平衡性。基于IB-Score的分析表明,采用常见正则化项的流行在线RL方法(如GRPO)在训练过程中无法持续维持平衡,导致次优结果。为解决此问题,我们提出信息瓶颈驱动的树基策略优化(IB-TPO),这是一个原则性框架,将IB-Score作为细粒度优化目标,并采用新颖的IB引导树采样策略——该策略不仅能在相同token预算下提升在线采样效率(多生成50%的轨迹),还能复用树结构实现高效的IB-Score蒙特卡洛估计。在标准基准上的大量实验表明,我们的方法相比GRPO基线显著提升2.9%至3.6%,同时优于其他主流在线RL方法。代码已开源:https://github.com/alibaba/EfficientRL。
English
Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-optimal performance. We introduce IB-Score, a novel metric grounded in Information Bottleneck theory that evaluates policy's exploration-exploitation balance by quantifying the trade-off between step-level reasoning diversity and mutual information shared with the correct answer. Analysis based on IB-Score shows that popular online RL approaches (e.g., GRPO) with common regularizers fail to consistently maintain balance during training with suboptimal results. To address this, we propose Information Bottleneck-driven Tree-based Policy Optimization (IB-TPO), a principled framework that formulates IB-Score as a fine-grained optimization objective and utilizes a novel IB-guided tree sampling strategy that not only improves the efficiency of online sampling with 50% more trajectories under the same token budget, but also reuses the tree structure for effective IB-Score Monte Carlo estimation. Extensive experiments across standard benchmarks show that our method significantly outperforms GRPO baseline by 2.9% to 3.6% and also outperforms other state-of-the-art online RL approaches. Our code is available at https://github.com/alibaba/EfficientRL.