更优提示优化,更少提示需求
p1: Better Prompt Optimization with Fewer Prompts
April 9, 2026
作者: Zhaolin Gao, Yu, Wang, Bo Liu, Thorsten Joachims, Kianté Brantley, Wen Sun
cs.AI
摘要
提示詞優化無需更新權重參數,即可通過搜尋更優系統提示詞來改進語言模型,但其效果在不同任務間差異顯著。本研究探討了影響任務是否適合進行提示詞優化的關鍵因素。我們發現不同系統提示詞的獎勵方差可分解為兩個組成部分:表徵生成隨機性的響應方差,以及反映系統提示詞質量差異的系統提示方差。當系統提示方差足夠大時提示詞優化容易成功,但當響應方差主導系統提示方差時優化則會失效。令人驚訝的是,我們進一步發現擴展更多用戶提示詞反而可能削弱優化效果——這會降低系統提示方差,尤其在異質數據集中(不同用戶提示詞偏好不同的系統提示詞)。基於此發現,我們提出p1方法:通過篩選在候選系統提示詞間呈現高方差的用戶提示詞構成小型子集。該子集能有效區分系統提示詞優劣,從而簡化系統優化流程。在推理基準測試上的實驗表明,p1相較於全數據集訓練能顯著提升提示詞優化效果,並勝過GEPA等強基線模型。值得注意的是,僅使用AIME 24的兩個提示詞進行訓練所得出的系統提示詞,在其他推理基準測試中亦展現出優異的泛化能力。
English
Prompt optimization improves language models without updating their weights by searching for a better system prompt, but its effectiveness varies widely across tasks. We study what makes a task amenable to prompt optimization. We show that the reward variance across different system prompts can be decomposed into two components: variance among responses, which captures generation stochasticity, and variance among system prompts, which captures differences in system prompt quality. Prompt optimization succeeds when variance among system prompts is sufficiently large, but fails when variance among responses dominates the variance of the system prompts. Surprisingly, we further show that scaling to more user prompts can hurt optimization by reducing variance among system prompts, especially on heterogeneous datasets where different user prompts favor different system prompts. Motivated by this insight, we propose p1, a simple user prompt filtering method that selects a small subset of user prompts with high variance across candidate system prompts. This subset of user prompts allows one to distinguish a good system prompt from a bad one, making system optimization easier. Experiments on reasoning benchmarks show that p1 substantially improves prompt optimization over training on the full dataset and outperforms strong baselines such as GEPA. Notably, training on only two prompts from AIME 24 yields a system prompt that generalizes well to other reasoning benchmarks.