더 적은 프롬프트로 더 나은 프롬프트 최적화

초록

프롬프트 최적화는 더 나은 시스템 프롬프트를 탐색함으로써 가중치를 업데이트하지 않고도 언어 모델을 개선하지만, 그 효과는 작업에 따라 크게 다릅니다. 본 연구에서는 어떤 작업이 프롬프트 최적화에 적합한지 분석합니다. 서로 다른 시스템 프롬프트에 대한 보상 분산은 생성 확률적 특성을 나타내는 응답 간 분산과 시스템 프롬프트 품질 차이를 나타내는 시스템 프롬프트 간 분산으로 분해될 수 있음을 보여줍니다. 프롬프트 최적화는 시스템 프롬프트 간 분산이 충분히 클 때 성공하지만, 응답 간 분산이 시스템 프롬프트 분산을 지배할 때는 실패합니다. 흥미롭게도 더 많은 사용자 프롬프트를 확장하면 시스템 프롬프트 간 분산을 감소시켜 오히려 최적화를 저해할 수 있으며, 특히 서로 다른 사용자 프롬프트가 서로 다른 시스템 프롬프트를 선호하는 이질적 데이터셋에서 이러한 현상이 두드러집니다. 이러한 통찰을 바탕으로, 후보 시스템 프롬프트 간 높은 분산을 보이는 소규모 사용자 프롬프트 하위 집합을 선택하는 간단한 사용자 프롬프트 필터링 방법인 p1을 제안합니다. 이 하위 집단은 좋은 시스템 프롬프트와 나쁜 시스템 프롬프트를 구분할 수 있게 하여 시스템 최적화를 용이하게 합니다. 추론 벤치마크 실험에서 p1은 전체 데이터셋을 사용한 학습 대비 프롬프트 최적화를 크게 개선하며 GEPA와 같은 강력한 베이스라인을 능가하는 것으로 나타났습니다. 특히 AIME 24의 단 두 개의 프롬프트만으로 학습해도 다른 추론 벤치마크에 잘 일반화되는 시스템 프롬프트를 얻을 수 있었습니다.

English

Prompt optimization improves language models without updating their weights by searching for a better system prompt, but its effectiveness varies widely across tasks. We study what makes a task amenable to prompt optimization. We show that the reward variance across different system prompts can be decomposed into two components: variance among responses, which captures generation stochasticity, and variance among system prompts, which captures differences in system prompt quality. Prompt optimization succeeds when variance among system prompts is sufficiently large, but fails when variance among responses dominates the variance of the system prompts. Surprisingly, we further show that scaling to more user prompts can hurt optimization by reducing variance among system prompts, especially on heterogeneous datasets where different user prompts favor different system prompts. Motivated by this insight, we propose p1, a simple user prompt filtering method that selects a small subset of user prompts with high variance across candidate system prompts. This subset of user prompts allows one to distinguish a good system prompt from a bad one, making system optimization easier. Experiments on reasoning benchmarks show that p1 substantially improves prompt optimization over training on the full dataset and outperforms strong baselines such as GEPA. Notably, training on only two prompts from AIME 24 yields a system prompt that generalizes well to other reasoning benchmarks.

더 적은 프롬프트로 더 나은 프롬프트 최적화

p1: Better Prompt Optimization with Fewer Prompts

초록

Support