プロンプトエンジニアのためのプロンプトエンジニアリング

要旨

プロンプトエンジニアリングは、大規模言語モデル（LLM）のパフォーマンスを最適化するための挑戦的でありながら重要なタスクです。これには、モデルのエラーを検証し、現在のプロンプトに欠けている点や誤解を招く点を仮説立て、タスクを明確に伝えるための複雑な推論が必要です。最近の研究では、LLMをメタプロンプト化して自動プロンプトエンジニアリングを実行できることが示されていますが、メタプロンプトにおいてLLMの複雑な推論能力を引き出すための十分なガイダンスが不足しているため、その潜在能力が十分に発揮されていない可能性があります。本研究では、「プロンプトエンジニアをプロンプトエンジニアリングする」問題、つまり、LLMが自動プロンプトエンジニアリングをより効果的に実行するためのメタプロンプトを構築する問題を調査します。ステップバイステップの推論テンプレートやコンテキスト指定などの主要なコンポーネントを導入し、その性能向上を分析します。さらに、バッチサイズ、ステップサイズ、モーメンタムなどの一般的な最適化概念に着想を得て、それらの言語化された対応物をメタプロンプトに導入し、その効果を調査します。最終的な手法であるPE2は、「let's think step by step」をMultiArithデータセットで6.3%、GSM8Kデータセットで3.1%上回るプロンプトを見つけます。その汎用性を示すために、PE2をInstruction Inductionベンチマーク、一連の反事実タスク、および長文の実世界の産業用プロンプトに適用します。これらの設定において、PE2は強力なパフォーマンスを発揮し、従来の自動プロンプトエンジニアリングのベースラインを上回ります。さらに、PE2が意味のあるターゲットを絞ったプロンプト編集を行い、誤ったまたは不完全なプロンプトを修正し、非自明な反事実推論能力を示すことを示します。

English

Prompt engineering is a challenging yet crucial task for optimizing the performance of large language models (LLMs). It requires complex reasoning to examine the model's errors, hypothesize what is missing or misleading in the current prompt, and communicate the task with clarity. While recent works indicate that LLMs can be meta-prompted to perform automatic prompt engineering, their potentials may not be fully untapped due to the lack of sufficient guidance to elicit complex reasoning capabilities in LLMs in the meta-prompt. In this work, we investigate the problem of "prompt engineering a prompt engineer" -- constructing a meta-prompt that more effectively guides LLMs to perform automatic prompt engineering. We introduce and analyze key components, such as a step-by-step reasoning template and context specification, which lead to improved performance. In addition, inspired by common optimization concepts such as batch size, step size and momentum, we introduce their verbalized counterparts to the meta-prompt and investigate their effects. Our final method, named PE2, finds a prompt that outperforms "let's think step by step" by 6.3% on the MultiArith dataset and 3.1% on the GSM8K dataset. To demonstrate its versatility, we apply PE2 to the Instruction Induction benchmark, a suite of counterfactual tasks, and a lengthy, real-world industrial prompt. In these settings, PE2 achieves strong performance and outperforms prior automatic prompt engineering baselines. Further, we show that PE2 makes meaningful and targeted prompt edits, amends erroneous or incomplete prompts, and presents non-trivial counterfactual reasoning abilities.

プロンプトエンジニアのためのプロンプトエンジニアリング

Prompt Engineering a Prompt Engineer

要旨

Support