균형 만세: 정보 병목 기반 트리 기반 정책 최적화

초록

대규모 언어 모델(LLM)을 위한 온라인 강화 학습(RL)의 최근 발전은 복잡한 추론 작업에서 유망한 성능을 입증해 왔다. 그러나 이러한 방법들은 종종 불균형한 탐색-활용 트레이드오프를 보여주어, 불안정한 최적화와 최적 이하의 성능을 초래한다. 우리는 정보 병목(Information Bottleneck) 이론에 기반한 새로운 지표인 IB-Score를 소개한다. 이는 단계별 추론 다양성과 정답과 공유되는 상호 정보 간의 트레이드오프를 정량화하여 정책의 탐색-활용 균형을 평가한다. IB-Score 기반 분석은 일반적인 정규화기를 사용하는 널리 알려진 온라인 RL 접근법(예: GRPO)이 훈련 중 균형을 일관되게 유지하지 못하고 최적 이하의 결과를 낳는다는 것을 보여준다. 이를 해결하기 위해, 우리는 정보 병목 기반 트리 정책 최적화(Information Bottleneck-driven Tree-based Policy Optimization, IB-TPO)를 제안한다. 이는 IB-Score를 세분화된 최적화 목표로 정식화하고, 새로운 IB-유도 트리 샘플링 전략을 활용하는 원칙적인 프레임워크이다. 이 전략은 동일한 토큰 예산 하에서 50% 더 많은 궤적으로 온라인 샘플링의 효율성을 향상시킬 뿐만 아니라, 트리 구조를 재사용하여 효과적인 IB-Score 몬테카를로 추정을 가능하게 한다. 표준 벤치마크에 걸친 광범위한 실험은 우리의 방법이 GRPO 기준선보다 2.9%에서 3.6%까지 크게 향상된 성능을 보이며, 다른 최신 온라인 RL 접근법들도 능가함을 보여준다. 우리의 코드는 https://github.com/alibaba/EfficientRL에서 확인할 수 있다.

English

Recent advances in online reinforcement learning (RL) for large language models (LLMs) have demonstrated promising performance in complex reasoning tasks. However, they often exhibit an imbalanced exploration-exploitation trade-off, resulting in unstable optimization and sub-optimal performance. We introduce IB-Score, a novel metric grounded in Information Bottleneck theory that evaluates policy's exploration-exploitation balance by quantifying the trade-off between step-level reasoning diversity and mutual information shared with the correct answer. Analysis based on IB-Score shows that popular online RL approaches (e.g., GRPO) with common regularizers fail to consistently maintain balance during training with suboptimal results. To address this, we propose Information Bottleneck-driven Tree-based Policy Optimization (IB-TPO), a principled framework that formulates IB-Score as a fine-grained optimization objective and utilizes a novel IB-guided tree sampling strategy that not only improves the efficiency of online sampling with 50% more trajectories under the same token budget, but also reuses the tree structure for effective IB-Score Monte Carlo estimation. Extensive experiments across standard benchmarks show that our method significantly outperforms GRPO baseline by 2.9% to 3.6% and also outperforms other state-of-the-art online RL approaches. Our code is available at https://github.com/alibaba/EfficientRL.