平衡萬歲：信息瓶頸驅動的基於樹的策略優化

摘要

近年來，針對大型語言模型（LLMs）的線上強化學習（RL）研究在複雜推理任務上展現出良好潛力。然而，此類方法常呈現探索-利用權衡失衡的現象，導致最佳化不穩定與次優表現。我們提出IB-Score——一種奠基於資訊瓶頸理論的新穎評估指標，透過量化逐步驟推理多樣性與正確答案共享互訊息之間的權衡，來評估策略的探索-利用平衡性。基於IB-Score的分析顯示，現有常見線上RL方法（例如GRPO）搭配一般正則化項時，在訓練過程中無法持續維持平衡，導致結果欠佳。為解決此問題，我們提出資訊瓶頸驅動的樹狀策略最佳化（IB-TPO）——一個具原則性的框架，將IB-Score形式化為細粒度最佳化目標，並採用新穎的IB引導樹搜索採樣策略。該策略不僅在相同token預算下提升線上採樣效率（增加50%的軌跡），同時還可複用樹結構以進行高效的IB-Score蒙特卡洛估計。在標準基準測試上的大量實驗結果顯示，我們的方法顯著優於GRPO基線（提升2.9%至3.6%），並超越其他最先進的線上RL方法。我們的程式碼已公開於 https://github.com/alibaba/EfficientRL。

English

Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-optimal performance. We introduce IB-Score, a novel metric grounded in Information Bottleneck theory that evaluates policy's exploration-exploitation balance by quantifying the trade-off between step-level reasoning diversity and mutual information shared with the correct answer. Analysis based on IB-Score shows that popular online RL approaches (e.g., GRPO) with common regularizers fail to consistently maintain balance during training with suboptimal results. To address this, we propose Information Bottleneck-driven Tree-based Policy Optimization (IB-TPO), a principled framework that formulates IB-Score as a fine-grained optimization objective and utilizes a novel IB-guided tree sampling strategy that not only improves the efficiency of online sampling with 50% more trajectories under the same token budget, but also reuses the tree structure for effective IB-Score Monte Carlo estimation. Extensive experiments across standard benchmarks show that our method significantly outperforms GRPO baseline by 2.9% to 3.6% and also outperforms other state-of-the-art online RL approaches. Our code is available at https://github.com/alibaba/EfficientRL.