언어 모델 함수 호출을 통한 반성적 프롬프트 튜닝

초록

대규모 언어 모델(LLM)은 지침 수행과 복잡한 추론 능력이 점점 향상되면서, 파라미터 업데이트 없이 모델을 적용할 수 있는 유연한 인터페이스로서 프롬프팅이 주목받고 있다. 그러나 프롬프트 설계는 여전히 노동 집약적이며 형식, 표현 방식, 지침 순서에 매우 민감하여, 수동 작업을 줄이면서도 추론 시 유연성을 유지하는 자동화된 프롬프트 최적화 방법이 필요하게 되었다. 하지만 기존 방법들은 대개 프롬프트 후보군을 탐색하거나 개별 예시 또는 소규모 배치에 기반한 고정된 비평-수정 파이프라인을 사용하기 때문에, 체계적인 오류 패턴을 포착하고 실패 이력을 바탕으로 한 정밀한 수정을 수행하는 데 한계가 있다. 본 연구에서는 인간 프롬프트 엔지니어의 반복적 작업 흐름을 모사하기 위해 LLM 함수 호출을 활용하는 프레임워크인 반영적 프롬프트 튜닝(Reflective Prompt Tuning, RPT)을 제안한다. LLM 옵티마이저는 전체 최적화 세트에 대해 대상 모델을 평가하고, 반복적으로 발생하는 실패 유형을 요약하며, 구조화된 진단 보고서를 반환하는 진단 함수를 호출한다. 옵티마이저는 이 보고서와 이전 보고서들의 축적된 메모리를 활용하여 다음 반복을 위한 프롬프트를 수정한다. 또한 RPT는 진단 피드백과 최종 프롬프트 선택에 교정 신호(calibration signals)를 활용하여 신뢰도 인식 최적화를 지원한다. 세 가지 추론 과제에서 RPT는 초기 프롬프트 대비 최대 12.9점 향상되었으며, 최신 기술과 경쟁력을 유지하고 신뢰도 교정을 개선하였다. 분석 결과, RPT는 다중 홉 및 수학적 추론에서 특히 효과적이며, 진단된 실패 패턴에 부합하는 정밀한 프롬프트 수정을 통해 과제 성능과 교정에서 이점을 제공함을 보여준다.

English

Large language models (LLMs) have become increasingly capable of following instructions and complex reasoning, making prompting a flexible interface for adapting models without parameter updates. Yet prompt design remains labor-intensive and highly sensitive to formatting, phrasing, and instruction order, motivating automated prompt optimization methods that reduce manual effort while preserving inference-time flexibility. However, existing methods often search over prompt candidates or use fixed critique-refine pipelines driven by individual examples or small batches, limiting their ability to capture systematic error patterns and make targeted edits grounded in failure history. We propose Reflective Prompt Tuning (RPT), a framework that uses LLM function calling to simulate the iterative workflow of human prompt engineers. An LLM optimizer calls a diagnostic function that evaluates the target model over an entire optimization set, summarizes recurring failure modes, and returns a structured diagnostic report. The optimizer uses this report, together with an accumulated memory of prior reports, to revise the prompt for the next iteration. RPT further supports confidence-aware optimization by using calibration signals in diagnostic feedback and final prompt selection. Across three reasoning tasks, RPT improves over initial prompts by up to 12.9 points, remains competitive with state of the art, and improves confidence calibration. Our analyses show that RPT is especially effective on multi-hop and mathematical reasoning, producing targeted prompt revisions that align with diagnosed failure patterns and lead to gains in task performance and calibration.