基于信息增益的策略优化:一种简单有效的多轮大语言模型代理方法
Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents
October 16, 2025
作者: Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, Zhenzhe Ying
cs.AI
摘要
基于大语言模型(LLM)的智能体正越来越多地通过强化学习(RL)进行训练,以增强其利用工具与外部环境交互的能力,特别是在需要多轮推理和知识获取的搜索场景中。然而,现有方法通常依赖于仅在最终答案处提供的基于结果的奖励。这种奖励稀疏性在多轮交互中尤为突出,长轨迹加剧了两个关键问题:(i)优势崩溃,即所有探索路径获得相同奖励,无法提供有效的学习信号;(ii)缺乏细粒度信用分配,即轮次间的依赖关系被掩盖,尤其是在长时程任务中。本文提出了一种基于信息增益的策略优化(IGPO),这是一种简单而有效的RL框架,为多轮智能体训练提供了密集且内在的监督。IGPO将每次交互轮次建模为逐步获取关于真实情况信息的过程,并将轮次级奖励定义为策略生成正确答案概率的边际增长。与依赖外部奖励模型或昂贵蒙特卡洛估计的先前过程级奖励方法不同,IGPO直接从模型自身的信念更新中推导出内在奖励。这些内在的轮次级奖励与结果级监督相结合,形成密集的奖励轨迹。在领域内和领域外基准上的大量实验表明,IGPO在多轮场景中始终优于强基线,实现了更高的准确性和改进的样本效率。
English
Large language model (LLM)-based agents are increasingly trained with
reinforcement learning (RL) to enhance their ability to interact with external
environments through tool use, particularly in search-based settings that
require multi-turn reasoning and knowledge acquisition. However, existing
approaches typically rely on outcome-based rewards that are only provided at
the final answer. This reward sparsity becomes particularly problematic in
multi-turn settings, where long trajectories exacerbate two critical issues:
(i) advantage collapse, where all rollouts receive identical rewards and
provide no useful learning signals, and (ii) lack of fine-grained credit
assignment, where dependencies between turns are obscured, especially in
long-horizon tasks. In this paper, we propose Information Gain-based Policy
Optimization (IGPO), a simple yet effective RL framework that provides dense
and intrinsic supervision for multi-turn agent training. IGPO models each
interaction turn as an incremental process of acquiring information about the
ground truth, and defines turn-level rewards as the marginal increase in the
policy's probability of producing the correct answer. Unlike prior
process-level reward approaches that depend on external reward models or costly
Monte Carlo estimation, IGPO derives intrinsic rewards directly from the
model's own belief updates. These intrinsic turn-level rewards are combined
with outcome-level supervision to form dense reward trajectories. Extensive
experiments on both in-domain and out-of-domain benchmarks demonstrate that
IGPO consistently outperforms strong baselines in multi-turn scenarios,
achieving higher accuracy and improved sample efficiency.