InfoPO: 사용자 중심 에이전트를 위한 정보 기반 정책 최적화

초록

실제 환경에서 LLM 에이전트에 대한 사용자 요청은 종종 불충분하게 명세됩니다. 에이전트는 누락된 정보를 획득하고 올바른 하류 결정을 내리기 위해 상호 작용해야 합니다. 그러나 현재 다중 턴 GRPO 기반 방법들은 궤적 수준의 보상 계산에 의존하는 경우가 많아, 롤아웃 그룹 내에서 크레딧 할당 문제와 불충분한 어드밴티지 신호를 초래합니다. 실현 가능한 접근법은 보다 세분화된 수준에서 가치 있는 상호 작용 턴을 식별하여 더 타겟팅된 학습을 주도하는 것입니다. 이를 해결하기 위해 우리는 InfoPO(정보 기반 정책 최적화)를 소개합니다. InfoPO는 다중 턴 상호 작용을 능동적 불확실성 감소 과정으로 구성하고, 특정 턴의 피드백이 마스킹된 피드백 반사실적 시나리오와 비교하여 에이전트의 후속 행동 분포를 측정 가능하게 변화시킬 때 해당 턴에 크레딧을 부여하는 정보 이득 보상을 계산합니다. 그런 다음 이 신호를 작업 결과와 적응형 분산 게이트(d) 퓨전을 통해 결합하여 정보의 중요도를 식별함과 동시에 작업 지향적 목표 방향성을 유지합니다. 의도 명확화, 협력적 코딩, 도구 강화 의사 결정을 포함한 다양한 작업에서 InfoPO는 프롬프팅 및 다중 턴 RL 베이스라인을 꾸준히 능가합니다. 또한 사용자 시뮬레이터 변화 하에서 강건성을 보여주며 환경-상호작용 작업에 효과적으로 일반화됩니다. 전반적으로 InfoPO는 복잡한 에이전트-사용자 협업을 최적화하기 위한 원칙적이고 확장 가능한 메커니즘을 제공합니다. 코드는 https://github.com/kfq20/InfoPO 에서 이용 가능합니다.

English

Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information-Driven Policy Optimization), which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked-feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance-gated fusion to identify information importance while maintaining task-oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool-augmented decision making, InfoPO consistently outperforms prompting and multi-turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment-interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent-user collaboration. Code is available at https://github.com/kfq20/InfoPO.

InfoPO: 사용자 중심 에이전트를 위한 정보 기반 정책 최적화

InfoPO: Information-Driven Policy Optimization for User-Centric Agents

초록

Support