SePO：用於系統提示優化的自演化提示代理

摘要

系統提示最佳化可在不修改底層模型的情況下改善智能體行為，產生人類可讀且與模型無關的指令。現有方法透過建立提示智能體來精煉任務智能體的系統提示，但其自身的系統提示仍採用手動設計且固定不變。我們提出自演化提示最佳化（SePO），將提示智能體自身的系統提示連同任務智能體的系統提示一同視為最佳化目標。SePO採用自指涉設計，由單一提示智能體同時改進任務智能體的系統提示及其自身，採用開放式演化搜尋機制，維護一個候選提示檔案庫作為階梯式前進的基石。訓練分為兩階段：預訓練階段在多任務池上演化提示智能體，微調階段則將其應用於特定目標任務。在涵蓋數學（AIME'25）、抽象推理（ARC-AGI-1）、研究生級科學（GPQA）、程式碼生成（MBPP）及邏輯謎題（數獨）等五項基準測試中，SePO一致優於Manual-CoT、TextGrad及MetaSPO，相較於Manual-CoT平均準確率提升4.49個百分點。此外，預訓練所得的提示最佳化技能能夠泛化至訓練混合任務以外的任務，而非僅記憶各別任務的提示。

English

System prompt optimization improves agent behavior without modifying the underlying model, yielding human-readable, model-agnostic instructions. Existing methods build a prompt agent that refines task agents' system prompts, yet leave the prompt agent's own system prompt hand-engineered and fixed. We propose Self-Evolving Prompt Optimization (SePO), which treats the prompt agent's own system prompt as an optimization target alongside task agents' system prompts. SePO adopts a self-referential design. A single prompt agent improves both task agents' system prompts and its own under an open-ended evolutionary search that maintains an archive of candidate prompts as stepping stones. Training proceeds in two stages: pre-training evolves the prompt agent on a multi-task pool, and fine-tuning then applies it to a target task. Across five benchmarks spanning math (AIME'25), abstract reasoning (ARC-AGI-1), graduate-level science (GPQA), code generation (MBPP), and logic puzzles (Sudoku), SePO consistently outperforms Manual-CoT, TextGrad, and MetaSPO, improving the average accuracy by 4.49 points compared to Manual-CoT. The prompt optimization skill from pre-training also generalizes to tasks beyond the pre-training mixture, rather than memorizing per-task prompts.