InfoPO: Informationsgesteuerte Politikoptimierung für nutzerzentrierte Agenten

Zusammenfassung

Echtwelt-Anfragen von Nutzern an LLM-Agenten sind häufig unvollständig spezifiziert. Agenten müssen interagieren, um fehlende Informationen zu erlangen und korrekte nachgelagerte Entscheidungen zu treffen. Allerdings stützen sich aktuelle Methoden auf Multi-Turn-GRPO oft auf trajektorienbasierte Belohnungsberechnung, was zu Kreditzuweisungsproblemen und unzureichenden Advantage-Signalen innerhalb von Rollout-Gruppen führt. Ein praktikabler Ansatz ist die feingranulare Identifikation wertvoller Interaktionsrunden, um gezielteres Lernen zu ermöglichen. Hierfür führen wir InfoPO (Information-Driven Policy Optimization) ein, das Multi-Turn-Interaktion als Prozess aktiver Unsicherheitsreduktion modelliert und eine Informationsgewinn-Belohnung berechnet. Diese belohnt Interaktionsrunden, deren Rückmeldung die nachfolgende Aktionsverteilung des Agenten im Vergleich zu einer kontrafaktischen Masked-Feedback-Bedingung messbar verändert. Das Signal wird anschließend via adaptiver varianzgesteuerter Fusion mit Aufgabenresultaten kombiniert, um Informationsrelevanz zu bewerten bei gleichzeitiger Beibehaltung der aufgabenorientierten Zielausrichtung. In diversen Aufgaben – einschließlich Intent-Clarification, kollaborativer Programmierung und tool-gestützter Entscheidungsfindung – übertrifft InfoPO durchgängig Prompting- und Multi-Turn-RL-Baselines. Es zeigt zudem Robustheit bei Nutzersimulator-Verschiebungen und generalisiert effektiv auf umgebungsinteraktive Aufgaben. Insgesamt bietet InfoPO einen prinzipienbasierten und skalierbaren Mechanismus zur Optimierung komplexer Agent-Nutzer-Kollaboration. Code ist verfügbar unter https://github.com/kfq20/InfoPO.

English

Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information-Driven Policy Optimization), which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked-feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance-gated fusion to identify information importance while maintaining task-oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool-augmented decision making, InfoPO consistently outperforms prompting and multi-turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment-interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent-user collaboration. Code is available at https://github.com/kfq20/InfoPO.

InfoPO: Informationsgesteuerte Politikoptimierung für nutzerzentrierte Agenten

InfoPO: Information-Driven Policy Optimization for User-Centric Agents

Zusammenfassung

Support