设计一个提示工程师
Prompt Engineering a Prompt Engineer
November 9, 2023
作者: Qinyuan Ye, Maxamed Axmed, Reid Pryzant, Fereshte Khani
cs.AI
摘要
即时工程是优化大型语言模型(LLMs)性能的一项具有挑战性但至关重要的任务。它需要复杂的推理来检查模型的错误,假设当前提示中缺少或误导的内容,并清晰地传达任务。尽管最近的研究表明LLMs可以被元提示以执行自动提示工程,但由于缺乏足够的指导来引发LLMs在元提示中进行复杂推理能力,它们的潜力可能尚未完全发挥。在这项工作中,我们研究了“提示工程提示工程师”的问题 - 构建一个更有效引导LLMs执行自动提示工程的元提示。我们介绍并分析了关键组件,如逐步推理模板和上下文规范,这些组件可以提高性能。此外,受批量大小、步长和动量等常见优化概念的启发,我们引入它们的口头化对应项到元提示中,并研究它们的影响。我们的最终方法,命名为PE2,在MultiArith数据集上比“让我们逐步思考”高出6.3%,在GSM8K数据集上高出3.1%。为了展示其多功能性,我们将PE2应用于Instruction Induction基准测试、一系列反事实任务以及一个冗长的真实工业提示。在这些设置中,PE2取得了良好的性能,并优于先前的自动提示工程基线。此外,我们展示PE2进行了有意义且有针对性的提示编辑,修正了错误或不完整的提示,并展示了非平凡的反事实推理能力。
English
Prompt engineering is a challenging yet crucial task for optimizing the
performance of large language models (LLMs). It requires complex reasoning to
examine the model's errors, hypothesize what is missing or misleading in the
current prompt, and communicate the task with clarity. While recent works
indicate that LLMs can be meta-prompted to perform automatic prompt
engineering, their potentials may not be fully untapped due to the lack of
sufficient guidance to elicit complex reasoning capabilities in LLMs in the
meta-prompt. In this work, we investigate the problem of "prompt engineering a
prompt engineer" -- constructing a meta-prompt that more effectively guides
LLMs to perform automatic prompt engineering. We introduce and analyze key
components, such as a step-by-step reasoning template and context
specification, which lead to improved performance. In addition, inspired by
common optimization concepts such as batch size, step size and momentum, we
introduce their verbalized counterparts to the meta-prompt and investigate
their effects. Our final method, named PE2, finds a prompt that outperforms
"let's think step by step" by 6.3% on the MultiArith dataset and 3.1% on the
GSM8K dataset. To demonstrate its versatility, we apply PE2 to the Instruction
Induction benchmark, a suite of counterfactual tasks, and a lengthy, real-world
industrial prompt. In these settings, PE2 achieves strong performance and
outperforms prior automatic prompt engineering baselines. Further, we show that
PE2 makes meaningful and targeted prompt edits, amends erroneous or incomplete
prompts, and presents non-trivial counterfactual reasoning abilities.