InfoPO：面向用户中心型智能体的信息驱动策略优化

摘要

现实世界中用户对LLM智能体的请求往往存在信息不完整的问题。智能体需要通过交互获取缺失信息以做出正确的下游决策。然而当前基于多轮GRPO的方法通常依赖轨迹级奖励计算，这会导致信用分配问题及 rollout 组内优势信号不足。可行的解决思路是以细粒度识别有价值的交互轮次，从而驱动更具针对性的学习。为此，我们提出信息驱动策略优化（InfoPO）框架，将多轮交互建模为主动不确定性削减过程，通过计算信息增益奖励来量化关键交互轮次的价值——该奖励机制会对比实际反馈与掩码反馈反事实场景下智能体后续行动分布的变化。该框架通过自适应方差门控融合将此信号与任务结果相结合，在保持任务导向的同时识别信息重要性。在意图澄清、协同编程和工具增强决策等多样化任务中，InfoPO均显著优于提示学习和多轮强化学习基线方法。该框架还展现出用户模拟器偏移下的鲁棒性，并能有效泛化至环境交互型任务。总体而言，InfoPO为优化复杂人机协作提供了原则性强且可扩展的机制。代码已开源：https://github.com/kfq20/InfoPO。

English

Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information-Driven Policy Optimization), which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked-feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance-gated fusion to identify information importance while maintaining task-oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool-augmented decision making, InfoPO consistently outperforms prompting and multi-turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment-interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent-user collaboration. Code is available at https://github.com/kfq20/InfoPO.

InfoPO：面向用户中心型智能体的信息驱动策略优化

InfoPO: Information-Driven Policy Optimization for User-Centric Agents

摘要

Support