ChatPaper.aiChatPaper

平衡萬歲:信息瓶頸驅動的基於樹的策略優化

Long Live The Balance: Information Bottleneck Driven Tree-based Policy Optimization

May 27, 2026
作者: Hao Jiang, Shurui Li, Tianpeng Bu, Bowen Xu, Xin Liu, Qihua Chen, Hongtao Duan, Lulu Hu, Bin Yang, Minying Zhang
cs.AI

摘要

近年來,針對大型語言模型(LLMs)的線上強化學習(RL)研究在複雜推理任務上展現出良好潛力。然而,此類方法常呈現探索-利用權衡失衡的現象,導致最佳化不穩定與次優表現。我們提出IB-Score——一種奠基於資訊瓶頸理論的新穎評估指標,透過量化逐步驟推理多樣性與正確答案共享互訊息之間的權衡,來評估策略的探索-利用平衡性。基於IB-Score的分析顯示,現有常見線上RL方法(例如GRPO)搭配一般正則化項時,在訓練過程中無法持續維持平衡,導致結果欠佳。為解決此問題,我們提出資訊瓶頸驅動的樹狀策略最佳化(IB-TPO)——一個具原則性的框架,將IB-Score形式化為細粒度最佳化目標,並採用新穎的IB引導樹搜索採樣策略。該策略不僅在相同token預算下提升線上採樣效率(增加50%的軌跡),同時還可複用樹結構以進行高效的IB-Score蒙特卡洛估計。在標準基準測試上的大量實驗結果顯示,我們的方法顯著優於GRPO基線(提升2.9%至3.6%),並超越其他最先進的線上RL方法。我們的程式碼已公開於 https://github.com/alibaba/EfficientRL。
English
Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-optimal performance. We introduce IB-Score, a novel metric grounded in Information Bottleneck theory that evaluates policy's exploration-exploitation balance by quantifying the trade-off between step-level reasoning diversity and mutual information shared with the correct answer. Analysis based on IB-Score shows that popular online RL approaches (e.g., GRPO) with common regularizers fail to consistently maintain balance during training with suboptimal results. To address this, we propose Information Bottleneck-driven Tree-based Policy Optimization (IB-TPO), a principled framework that formulates IB-Score as a fine-grained optimization objective and utilizes a novel IB-guided tree sampling strategy that not only improves the efficiency of online sampling with 50% more trajectories under the same token budget, but also reuses the tree structure for effective IB-Score Monte Carlo estimation. Extensive experiments across standard benchmarks show that our method significantly outperforms GRPO baseline by 2.9% to 3.6% and also outperforms other state-of-the-art online RL approaches. Our code is available at https://github.com/alibaba/EfficientRL.