言語モデルの関数呼び出しによるリフレクティブ・プロンプト・チューニング

要旨

大規模言語モデル（LLM）は、指示追随や複雑な推論能力がますます高まっており、パラメータ更新を行わずにモデルを適応させる柔軟なインターフェースとしてプロンプティングが活用されている。しかし、プロンプトの設計は依然として手間がかかり、書式や言い回し、指示の順序に極めて敏感であるため、推論時の柔軟性を維持しつつ手作業を減らす自動プロンプト最適化手法が注目されている。既存手法の多くは、プロンプト候補の探索や、個別の例や小バッチに基づく固定的な批判・修正パイプラインに依存しており、系統的な誤りパターンを捉え、失敗履歴に基づいた対象を絞った修正を行う能力が限られている。本稿では、人間のプロンプトエンジニアによる反復的な作業フローをLLMの関数呼び出しで模倣するフレームワーク、Reflective Prompt Tuning（RPT）を提案する。LLMオプティマイザは、診断関数を呼び出して最適化セット全体で対象モデルを評価し、再発する障害モードを要約し、構造化された診断レポートを返す。オプティマイザはこのレポートと、過去のレポートを蓄積したメモリを用いて、次の反復でプロンプトを修正する。RPTはさらに、診断フィードバックと最終的なプロンプト選択にキャリブレーションシグナルを活用することで、信頼度を考慮した最適化を実現する。3つの推論タスクにおいて、RPTは初期プロンプトから最大12.9ポイントの改善を示し、最先端手法と競合しつつ、信頼度のキャリブレーションも向上させた。分析の結果、RPTは特にマルチホップ推論や数学的推論において効果的であり、診断された障害パターンに沿った対象を絞ったプロンプト修正を行い、タスク性能とキャリブレーションの両方で改善をもたらすことが明らかになった。

English

Large language models (LLMs) have become increasingly capable of following instructions and complex reasoning, making prompting a flexible interface for adapting models without parameter updates. Yet prompt design remains labor-intensive and highly sensitive to formatting, phrasing, and instruction order, motivating automated prompt optimization methods that reduce manual effort while preserving inference-time flexibility. However, existing methods often search over prompt candidates or use fixed critique-refine pipelines driven by individual examples or small batches, limiting their ability to capture systematic error patterns and make targeted edits grounded in failure history. We propose Reflective Prompt Tuning (RPT), a framework that uses LLM function calling to simulate the iterative workflow of human prompt engineers. An LLM optimizer calls a diagnostic function that evaluates the target model over an entire optimization set, summarizes recurring failure modes, and returns a structured diagnostic report. The optimizer uses this report, together with an accumulated memory of prior reports, to revise the prompt for the next iteration. RPT further supports confidence-aware optimization by using calibration signals in diagnostic feedback and final prompt selection. Across three reasoning tasks, RPT improves over initial prompts by up to 12.9 points, remains competitive with state of the art, and improves confidence calibration. Our analyses show that RPT is especially effective on multi-hop and mathematical reasoning, producing targeted prompt revisions that align with diagnosed failure patterns and lead to gains in task performance and calibration.