引擎化一位引擎工程師
Prompt Engineering a Prompt Engineer
November 9, 2023
作者: Qinyuan Ye, Maxamed Axmed, Reid Pryzant, Fereshte Khani
cs.AI
摘要
提示工程是優化大型語言模型(LLMs)性能的一項具有挑戰性但至關重要的任務。這需要複雜的推理來檢查模型的錯誤,假設當前提示中缺少或具有誤導性的內容,並清晰地傳達任務。儘管最近的研究表明LLMs可以被元提示以執行自動提示工程,但由於在元提示中缺乏足夠的指導以引發LLMs的複雜推理能力,它們的潛力可能尚未完全發揮。在這項工作中,我們研究了“提示工程提示工程師”的問題,即構建一個更有效引導LLMs執行自動提示工程的元提示。我們介紹並分析了關鍵組件,如逐步推理模板和上下文規範,這些組件導致了性能的改善。此外,受到批量大小、步長和動量等常見優化概念的啟發,我們將它們的口語化對應引入到元提示中並研究它們的影響。我們的最終方法,名為PE2,在MultiArith數據集上比“讓我們逐步思考”提高了6.3%,在GSM8K數據集上提高了3.1%。為了展示其多功能性,我們將PE2應用於指令歸納基準測試、一系列反事實任務以及一個冗長的現實工業提示。在這些設置中,PE2實現了強大的性能並優於先前的自動提示工程基準。此外,我們展示PE2進行有意義且有針對性的提示編輯,修正錯誤或不完整的提示,並展示非平凡的反事實推理能力。
English
Prompt engineering is a challenging yet crucial task for optimizing the
performance of large language models (LLMs). It requires complex reasoning to
examine the model's errors, hypothesize what is missing or misleading in the
current prompt, and communicate the task with clarity. While recent works
indicate that LLMs can be meta-prompted to perform automatic prompt
engineering, their potentials may not be fully untapped due to the lack of
sufficient guidance to elicit complex reasoning capabilities in LLMs in the
meta-prompt. In this work, we investigate the problem of "prompt engineering a
prompt engineer" -- constructing a meta-prompt that more effectively guides
LLMs to perform automatic prompt engineering. We introduce and analyze key
components, such as a step-by-step reasoning template and context
specification, which lead to improved performance. In addition, inspired by
common optimization concepts such as batch size, step size and momentum, we
introduce their verbalized counterparts to the meta-prompt and investigate
their effects. Our final method, named PE2, finds a prompt that outperforms
"let's think step by step" by 6.3% on the MultiArith dataset and 3.1% on the
GSM8K dataset. To demonstrate its versatility, we apply PE2 to the Instruction
Induction benchmark, a suite of counterfactual tasks, and a lengthy, real-world
industrial prompt. In these settings, PE2 achieves strong performance and
outperforms prior automatic prompt engineering baselines. Further, we show that
PE2 makes meaningful and targeted prompt edits, amends erroneous or incomplete
prompts, and presents non-trivial counterfactual reasoning abilities.