프롬프트 엔지니어를 위한 프롬프트 엔지니어링

초록

프롬프트 엔지니어링은 대규모 언어 모델(LLM)의 성능을 최적화하기 위해 도전적이면서도 중요한 과제이다. 이는 모델의 오류를 분석하고, 현재 프롬프트에서 누락되거나 오해의 소지가 있는 부분을 가설화하며, 작업을 명확하게 전달하기 위한 복잡한 추론을 요구한다. 최근 연구들은 LLM이 메타 프롬프트를 통해 자동 프롬프트 엔지니어링을 수행할 수 있음을 보여주지만, 메타 프롬프트 내에서 복잡한 추론 능력을 이끌어내기 위한 충분한 지침이 부족하여 그 잠재력이 완전히 발휘되지 못할 수 있다. 본 연구에서는 "프롬프트 엔지니어를 위한 프롬프트 엔지니어링" 문제를 탐구한다. 즉, LLM이 자동 프롬프트 엔지니어링을 더 효과적으로 수행하도록 안내하는 메타 프롬프트를 구성하는 것이다. 우리는 단계별 추론 템플릿 및 컨텍스트 명세와 같은 핵심 구성 요소를 소개하고 분석하며, 이를 통해 성능이 개선되는 것을 확인한다. 또한, 배치 크기, 스텝 크기, 모멘텀과 같은 일반적인 최적화 개념에서 영감을 받아 이들의 언어화된 대응물을 메타 프롬프트에 도입하고 그 효과를 조사한다. 우리의 최종 방법인 PE2는 MultiArith 데이터셋에서 "단계별로 생각해보자"보다 6.3%, GSM8K 데이터셋에서 3.1% 더 우수한 프롬프트를 찾아낸다. PE2의 다용성을 입증하기 위해, 우리는 Instruction Induction 벤치마크, 일련의 반사실적 작업, 그리고 길고 실질적인 산업용 프롬프트에 PE2를 적용한다. 이러한 설정에서 PE2는 강력한 성능을 보이며, 기존의 자동 프롬프트 엔지니어링 기준선을 능가한다. 더 나아가, PE2가 의미 있고 목표 지향적인 프롬프트 수정을 수행하고, 오류가 있거나 불완전한 프롬프트를 수정하며, 비범한 반사실적 추론 능력을 보여준다는 것을 입증한다.

English

Prompt engineering is a challenging yet crucial task for optimizing the performance of large language models (LLMs). It requires complex reasoning to examine the model's errors, hypothesize what is missing or misleading in the current prompt, and communicate the task with clarity. While recent works indicate that LLMs can be meta-prompted to perform automatic prompt engineering, their potentials may not be fully untapped due to the lack of sufficient guidance to elicit complex reasoning capabilities in LLMs in the meta-prompt. In this work, we investigate the problem of "prompt engineering a prompt engineer" -- constructing a meta-prompt that more effectively guides LLMs to perform automatic prompt engineering. We introduce and analyze key components, such as a step-by-step reasoning template and context specification, which lead to improved performance. In addition, inspired by common optimization concepts such as batch size, step size and momentum, we introduce their verbalized counterparts to the meta-prompt and investigate their effects. Our final method, named PE2, finds a prompt that outperforms "let's think step by step" by 6.3% on the MultiArith dataset and 3.1% on the GSM8K dataset. To demonstrate its versatility, we apply PE2 to the Instruction Induction benchmark, a suite of counterfactual tasks, and a lengthy, real-world industrial prompt. In these settings, PE2 achieves strong performance and outperforms prior automatic prompt engineering baselines. Further, we show that PE2 makes meaningful and targeted prompt edits, amends erroneous or incomplete prompts, and presents non-trivial counterfactual reasoning abilities.

프롬프트 엔지니어를 위한 프롬프트 엔지니어링

Prompt Engineering a Prompt Engineer

초록

Support