基於信息增益的策略優化:一種適用於多輪大型語言模型代理的簡潔高效方法
Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents
October 16, 2025
作者: Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, Zhenzhe Ying
cs.AI
摘要
基於大型語言模型(LLM)的智能體越來越多地通過強化學習(RL)進行訓練,以增強其利用工具與外部環境交互的能力,特別是在需要多輪推理和知識獲取的搜索型場景中。然而,現有方法通常依賴於僅在最終答案處提供的基於結果的獎勵。這種獎勵稀疏性在多輪場景中尤為突出,其中長軌跡加劇了兩個關鍵問題:(i)優勢崩潰,即所有軌跡獲得相同的獎勵,無法提供有用的學習信號;(ii)缺乏細粒度的信用分配,即輪次之間的依賴關係被模糊,特別是在長時程任務中。本文提出了一種基於信息增益的策略優化(IGPO),這是一種簡單而有效的RL框架,為多輪智能體訓練提供了密集且內在的監督。IGPO將每一輪交互建模為逐步獲取關於真實情況信息的過程,並將輪次級獎勵定義為策略生成正確答案概率的邊際增長。與依賴於外部獎勵模型或昂貴的蒙特卡羅估計的先前過程級獎勵方法不同,IGPO直接從模型自身的信念更新中推導出內在獎勵。這些內在的輪次級獎勵與結果級監督相結合,形成密集的獎勵軌跡。在域內和域外基準上的大量實驗表明,IGPO在多輪場景中始終優於強基線,實現了更高的準確性和改進的樣本效率。
English
Large language model (LLM)-based agents are increasingly trained with
reinforcement learning (RL) to enhance their ability to interact with external
environments through tool use, particularly in search-based settings that
require multi-turn reasoning and knowledge acquisition. However, existing
approaches typically rely on outcome-based rewards that are only provided at
the final answer. This reward sparsity becomes particularly problematic in
multi-turn settings, where long trajectories exacerbate two critical issues:
(i) advantage collapse, where all rollouts receive identical rewards and
provide no useful learning signals, and (ii) lack of fine-grained credit
assignment, where dependencies between turns are obscured, especially in
long-horizon tasks. In this paper, we propose Information Gain-based Policy
Optimization (IGPO), a simple yet effective RL framework that provides dense
and intrinsic supervision for multi-turn agent training. IGPO models each
interaction turn as an incremental process of acquiring information about the
ground truth, and defines turn-level rewards as the marginal increase in the
policy's probability of producing the correct answer. Unlike prior
process-level reward approaches that depend on external reward models or costly
Monte Carlo estimation, IGPO derives intrinsic rewards directly from the
model's own belief updates. These intrinsic turn-level rewards are combined
with outcome-level supervision to form dense reward trajectories. Extensive
experiments on both in-domain and out-of-domain benchmarks demonstrate that
IGPO consistently outperforms strong baselines in multi-turn scenarios,
achieving higher accuracy and improved sample efficiency.