DiPO:解耦困惑度策略优化算法——实现细粒度探索-利用权衡
DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off
April 15, 2026
作者: Xiaofan Li, Ming Yang, Zhiyuan Ma, Shichao Ma, Jintao Du, Yu Cheng, Weiqiang Wang, Zhizhong Zhang, Xin Tan, Yanyun Qu, Lizhuang Ma, Yuan Xie
cs.AI
摘要
基于可验证奖励的强化学习(RLVR)显著推动了大型语言模型推理能力的发展,但如何有效平衡探索与利用的权衡仍是关键挑战。本文深入分析了训练过程中极难样本与极易样本引发的探索-利用困境,提出了一种新型细粒度权衡机制。具体而言,我们引入困惑度空间解耦策略,将样本空间划分为探索子空间(高困惑度)和利用子空间(低困惑度),从而挖掘需要探索-利用权衡的细粒度样本。随后提出双向奖励分配机制,在最小化验证奖励干扰的前提下实现困惑度引导的探索与利用,使策略优化更加稳定。我们在数学推理和函数调用两大主流任务上评估了所提方法,实验结果表明该方法具有优越性,证实了通过细粒度探索-利用权衡提升LLM性能的有效性。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.