InfoPO: Informatiegestuurde beleidsoptimalisatie voor gebruikersgerichte agenten

Samenvatting

Verzoeken van gebruikers in de praktijk aan LLM-agents zijn vaak onvolledig gespecificeerd. Agents moeten interacteren om ontbrekende informatie te verkrijgen en correcte downstream-beslissingen te nemen. Huidige methoden gebaseerd op multi-turn GRPO steunen echter vaak op beloningsberekening op trajectniveau, wat leidt tot credit-assignmentsproblemen en onvoldoende voordelsignalen binnen rollout-groepen. Een haalbare aanpak is het identificeren van waardevolle interactierondes op fijne granulariteit om gerichter leren aan te sturen. Om dit aan te pakken, introduceren we InfoPO (Information-Driven Policy Optimization), dat multi-turn interactie benadert als een proces van actieve onzekerheidsreductie en een informatiewinstbeloning berekent die rondes crediteert waarvan de feedback de daaropvolgende actieverdeling van de agent meetbaar verandert in vergelijking met een tegenfactuele situatie met gemaskeerde feedback. Dit signaal wordt vervolgens gecombineerd met taakresultaten via een adaptieve variantie-afgeschermde fusie om informatie-importantie te identificeren terwijl de taakgerichte doelrichting behouden blijft. In diverse taken, waaronder intentieverduidelijking, collaboratief programmeren en tool-ondersteunde besluitvorming, presteert InfoPO consistent beter dan prompting en multi-turn RL-baselines. Het toont ook robuustheid onder verschuivingen in gebruikerssimulatie en generaliseert effectief naar taken met omgevingsinteractie. Al met al biedt InfoPO een principieel en schaalbaar mechanisme voor het optimaliseren van complexe agent-gebruiker samenwerking. Code is beschikbaar op https://github.com/kfq20/InfoPO.

English

Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information-Driven Policy Optimization), which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked-feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance-gated fusion to identify information importance while maintaining task-oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool-augmented decision making, InfoPO consistently outperforms prompting and multi-turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment-interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent-user collaboration. Code is available at https://github.com/kfq20/InfoPO.

InfoPO: Informatiegestuurde beleidsoptimalisatie voor gebruikersgerichte agenten

InfoPO: Information-Driven Policy Optimization for User-Centric Agents

Samenvatting

Support