SePO: 用于系统提示优化的自进化提示智能体

摘要

系统提示优化可在不修改底层模型的前提下改善智能体行为，生成人类可读且与模型无关的指令。现有方法通过构建一个提示体来优化任务智能体的系统提示，但提示体自身的系统提示仍采用手工设计并固定不变。为此，我们提出自进化提示优化（SePO）方法，将提示体自身的系统提示与任务智能体的系统提示共同作为优化目标。SePO采用自指设计：单个提示体在开放式进化搜索中同时改进任务智能体的系统提示和自身系统提示，该搜索维护一个候选提示存档作为垫脚石。训练分两阶段：预训练阶段在多任务池上演化提示体，微调阶段将其应用于目标任务。在涵盖数学（AIME'25）、抽象推理（ARC-AGI-1）、研究生级科学（GPQA）、代码生成（MBPP）和逻辑谜题（数独）的五项基准测试中，SePO始终优于Manual-CoT、TextGrad和MetaSPO，相较于Manual-CoT平均准确率提升4.49个百分点。预训练获得的提示优化能力还可泛化至训练混合任务之外的新任务，而非记忆各任务的特定提示。

English

System prompt optimization improves agent behavior without modifying the underlying model, yielding human-readable, model-agnostic instructions. Existing methods build a prompt agent that refines task agents' system prompts, yet leave the prompt agent's own system prompt hand-engineered and fixed. We propose Self-Evolving Prompt Optimization (SePO), which treats the prompt agent's own system prompt as an optimization target alongside task agents' system prompts. SePO adopts a self-referential design. A single prompt agent improves both task agents' system prompts and its own under an open-ended evolutionary search that maintains an archive of candidate prompts as stepping stones. Training proceeds in two stages: pre-training evolves the prompt agent on a multi-task pool, and fine-tuning then applies it to a target task. Across five benchmarks spanning math (AIME'25), abstract reasoning (ARC-AGI-1), graduate-level science (GPQA), code generation (MBPP), and logic puzzles (Sudoku), SePO consistently outperforms Manual-CoT, TextGrad, and MetaSPO, improving the average accuracy by 4.49 points compared to Manual-CoT. The prompt optimization skill from pre-training also generalizes to tasks beyond the pre-training mixture, rather than memorizing per-task prompts.