IIB-LPO:基于迭代信息瓶颈的潜在策略优化
IIB-LPO: Latent Policy Optimization via Iterative Information Bottleneck
January 9, 2026
作者: Huilin Deng, Hongchen Luo, Yue Zhu, Long Li, Zhuoyue Chen, Xinghao Zhao, Ming Li, Jihai Zhang, Mengchang Wang, Yang Cao, Yu Kang
cs.AI
摘要
大型语言模型推理中可验证奖励强化学习的最新进展,始终受限于一个顽固挑战:探索坍缩。随机推演的语义同质性常使模型陷入狭隘的过优化行为。现有方法虽利用策略熵鼓励探索,却存在固有局限:全局熵正则化易受奖励破解影响,可能导致无意义的冗长输出;而局部令牌选择性更新则难以克服预训练模型的强归纳偏差。为此,我们提出基于迭代信息瓶颈的潜策略优化方法,该创新方案将探索机制从令牌分布的统计扰动转向推理轨迹的拓扑分岔。IIB-LPO在高熵状态触发潜空间分岔以多样化推理路径,并运用信息瓶颈原理同时作为轨迹过滤器与自奖励机制,确保探索过程的简洁性与信息量。在四个数学推理基准测试上的实证结果表明,IIB-LPO实现了最先进性能,在准确率上最高超越现有方法5.3%,在多样性指标上领先达7.4%。
English
Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Model (LLM) reasoning have been hindered by a persistent challenge: exploration collapse. The semantic homogeneity of random rollouts often traps models in narrow, over-optimized behaviors. While existing methods leverage policy entropy to encourage exploration, they face inherent limitations. Global entropy regularization is susceptible to reward hacking, which can induce meaningless verbosity, whereas local token-selective updates struggle with the strong inductive bias of pre-trained models. To address this, we propose Latent Policy Optimization via Iterative Information Bottleneck (IIB-LPO), a novel approach that shifts exploration from statistical perturbation of token distributions to topological branching of reasoning trajectories. IIB-LPO triggers latent branching at high-entropy states to diversify reasoning paths and employs the Information Bottleneck principle both as a trajectory filter and a self-reward mechanism, ensuring concise and informative exploration. Empirical results across four mathematical reasoning benchmarks demonstrate that IIB-LPO achieves state-of-the-art performance, surpassing prior methods by margins of up to 5.3% in accuracy and 7.4% in diversity metrics.