通过语言模型函数调用进行反思性提示调优

摘要

大型语言模型（LLMs）在遵循指令和复杂推理方面的能力日益增强，使得提示（prompting）成为一种无需参数更新即可适配模型的灵活接口。然而，提示设计仍然劳动密集，且对格式、措辞和指令顺序高度敏感，这催生了自动化提示优化方法，旨在减少人工投入的同时保留推理时的灵活性。但现有方法通常遍历候选提示，或使用由单个样例或小批量数据驱动的固定批评-修正流水线，这限制了它们捕捉系统性错误模式并基于失败历史进行针对性编辑的能力。我们提出反思性提示调优（RPT）框架，该利用大语言模型的函数调用功能模拟人类提示工程师的迭代工作流程。优化器调用诊断函数，在完整优化集上评估目标模型，总结反复出现的失败模式，并返回结构化的诊断报告。优化器结合该报告与先前报告积累的记忆，为下一轮迭代修改提示。RPT还通过诊断反馈和最终提示选择中的校准信号支持置信度感知优化。在三个推理任务上，RPT相比初始提示最高提升12.9个百分点，与现有最优方法保持竞争力，并改善了置信度校准。我们的分析表明，RPT在多跳推理和数学推理任务上尤为有效，能够生成与诊断出的失败模式相符的针对性提示修订，从而在任务性能和校准方面均带来提升。

English

Large language models (LLMs) have become increasingly capable of following instructions and complex reasoning, making prompting a flexible interface for adapting models without parameter updates. Yet prompt design remains labor-intensive and highly sensitive to formatting, phrasing, and instruction order, motivating automated prompt optimization methods that reduce manual effort while preserving inference-time flexibility. However, existing methods often search over prompt candidates or use fixed critique-refine pipelines driven by individual examples or small batches, limiting their ability to capture systematic error patterns and make targeted edits grounded in failure history. We propose Reflective Prompt Tuning (RPT), a framework that uses LLM function calling to simulate the iterative workflow of human prompt engineers. An LLM optimizer calls a diagnostic function that evaluates the target model over an entire optimization set, summarizes recurring failure modes, and returns a structured diagnostic report. The optimizer uses this report, together with an accumulated memory of prior reports, to revise the prompt for the next iteration. RPT further supports confidence-aware optimization by using calibration signals in diagnostic feedback and final prompt selection. Across three reasoning tasks, RPT improves over initial prompts by up to 12.9 points, remains competitive with state of the art, and improves confidence calibration. Our analyses show that RPT is especially effective on multi-hop and mathematical reasoning, producing targeted prompt revisions that align with diagnosed failure patterns and lead to gains in task performance and calibration.