p1：以更少提示实现更优提示优化

摘要

提示优化通过搜索更优的系统提示词来改进语言模型，而无需更新其权重，但其效果在不同任务间差异显著。本研究旨在探究任务适合提示优化的关键因素。我们发现不同系统提示词间的奖励方差可分解为两个部分：响应间方差（反映生成随机性）和系统提示词间方差（反映提示词质量差异）。当系统提示词间方差足够大时提示优化容易成功，但当响应间方差占主导地位时优化则会失败。令人惊讶的是，进一步研究表明扩大用户提示词规模反而可能削弱优化效果——这会降低系统提示词间方差，尤其在异构数据集上（不同用户提示词适配不同系统提示词）。基于此发现，我们提出p1方法：通过筛选在候选系统提示词上表现方差较大的用户提示词构成小型子集。该子集能有效区分优质与劣质系统提示词，从而简化系统优化流程。在推理基准测试上的实验表明，p1相较于全数据集训练显著提升提示优化效果，并超越GEPA等强基线方法。值得注意的是，仅使用AIME 24中的两个提示词进行训练，所得系统提示词就能良好泛化至其他推理基准测试。

English

Prompt optimization improves language models without updating their weights by searching for a better system prompt, but its effectiveness varies widely across tasks. We study what makes a task amenable to prompt optimization. We show that the reward variance across different system prompts can be decomposed into two components: variance among responses, which captures generation stochasticity, and variance among system prompts, which captures differences in system prompt quality. Prompt optimization succeeds when variance among system prompts is sufficiently large, but fails when variance among responses dominates the variance of the system prompts. Surprisingly, we further show that scaling to more user prompts can hurt optimization by reducing variance among system prompts, especially on heterogeneous datasets where different user prompts favor different system prompts. Motivated by this insight, we propose p1, a simple user prompt filtering method that selects a small subset of user prompts with high variance across candidate system prompts. This subset of user prompts allows one to distinguish a good system prompt from a bad one, making system optimization easier. Experiments on reasoning benchmarks show that p1 substantially improves prompt optimization over training on the full dataset and outperforms strong baselines such as GEPA. Notably, training on only two prompts from AIME 24 yields a system prompt that generalizes well to other reasoning benchmarks.