基于指令-策略协同进化的能动策略优化

摘要

基於可驗證獎勵的強化學習（RLVR）通過增強大型語言模型（LLM）的推理能力，推動了能夠執行高效多輪對話及工具集成推理的自主智能體發展。儘管指令是定義智能體的主要協議，但RLVR通常依賴靜態且人工設計的指令。然而，這些指令對基礎模型可能並非最優解，且最優指令會隨智能體策略的改進及與環境互動的探索而動態變化。為彌合這一差距，我們提出INSPO——一種指令-策略協同演化的創新框架，將指令優化整合為強化學習（RL）循環中的動態組件。INSPO維護一個隨問題採樣的動態指令候選集，其中RL循環的獎勵信號會自動歸因於每條指令，並定期淘汰低效指令。新指令通過基於策略的反思機制生成與驗證：由LLM驅動的優化器分析回放緩衝區中的過往經驗，並根據當前策略演化出更有效的策略。我們在多輪檢索與推理任務上進行大量實驗，證明INSPO顯著優於依賴靜態指令的強基線模型。該框架能發現創新性指令，引導智能體走向更具戰略性的推理路徑，僅以邊際計算開銷增長實現顯著性能提升。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capability of large language models (LLMs), enabling autonomous agents that can conduct effective multi-turn and tool-integrated reasoning. While instructions serve as the primary protocol for defining agents, RLVR typically relies on static and manually designed instructions. However, those instructions may be suboptimal for the base model, and the optimal instruction may change as the agent's policy improves and explores the interaction with the environment. To bridge the gap, we introduce INSPO, a novel Instruction-Policy co-evolution framework that integrates instruction optimization as a dynamic component of the reinforcement learning (RL) loop. INSPO maintains a dynamic population of instruction candidates that are sampled with questions, where reward signals in RL loops are automatically attributed to each instruction, and low performers are periodically pruned. New instructions are generated and verified through an on-policy reflection mechanism, where an LLM-based optimizer analyzes past experience from a replay buffer and evolves more effective strategies given the current policy. We conduct extensive experiments on multi-turn retrieval and reasoning tasks, demonstrating that INSPO substantially outperforms strong baselines relying on static instructions. INSPO discovers innovative instructions that guide the agent toward more strategic reasoning paths, achieving substantial performance gains with only a marginal increase in computational overhead.

基于指令-策略协同进化的能动策略优化

Agentic Policy Optimization via Instruction-Policy Co-Evolution

摘要

Support