指令策略协同进化下的代理策略优化

摘要

可验证奖励强化学习（RLVR）显著提升了大型语言模型（LLM）的推理能力，使自主智能体能够执行有效的多轮交互与工具集成推理。尽管指令是定义智能体的主要协议，但RLVR通常依赖静态人工设计的指令。然而，这些指令对基础模型可能并非最优，且最优指令会随着智能体策略的提升及与环境交互的探索而动态变化。为弥补这一差距，我们提出INSPO——一种指令-策略协同演化的新型框架，将指令优化整合为强化学习（RL）循环中的动态组件。INSPO维护动态更新的指令候选池，通过问题采样进行验证：RL循环中的奖励信号自动关联至每条指令，并定期淘汰低效指令。新指令通过基于策略的反思机制生成与验证，由LLM驱动的优化器分析回放缓冲区中的历史经验，并根据当前策略演化出更有效的策略。我们在多轮检索与推理任务上进行了广泛实验，证明INSPO显著优于依赖静态指令的强基线方法。该框架能发掘创新性指令，引导智能体走向更具战略性的推理路径，仅以边际计算开销为代价实现显著性能提升。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capability of large language models (LLMs), enabling autonomous agents that can conduct effective multi-turn and tool-integrated reasoning. While instructions serve as the primary protocol for defining agents, RLVR typically relies on static and manually designed instructions. However, those instructions may be suboptimal for the base model, and the optimal instruction may change as the agent's policy improves and explores the interaction with the environment. To bridge the gap, we introduce INSPO, a novel Instruction-Policy co-evolution framework that integrates instruction optimization as a dynamic component of the reinforcement learning (RL) loop. INSPO maintains a dynamic population of instruction candidates that are sampled with questions, where reward signals in RL loops are automatically attributed to each instruction, and low performers are periodically pruned. New instructions are generated and verified through an on-policy reflection mechanism, where an LLM-based optimizer analyzes past experience from a replay buffer and evolves more effective strategies given the current policy. We conduct extensive experiments on multi-turn retrieval and reasoning tasks, demonstrating that INSPO substantially outperforms strong baselines relying on static instructions. INSPO discovers innovative instructions that guide the agent toward more strategic reasoning paths, achieving substantial performance gains with only a marginal increase in computational overhead.

指令策略协同进化下的代理策略优化

Agentic Policy Optimization via Instruction-Policy Co-Evolution

摘要

Support