FAPO：多步骤大语言模型流水线的全自主提示优化

摘要

多步LLM流水线因检索、推理和格式化步骤之间的交互而失败，因此仅通过提示优化可能遗漏链中的瓶颈。我们提出FAPO（全自动提示优化），这是一个框架，能让Claude Code在标准化代码库中优化LLM流水线。FAPO评估流水线、检查中间步骤、诊断故障、提出局部修改建议，并反复验证变体，以针对评分函数进行优化。它首先尝试提示编辑，仅当提示优化不足且归因分析识别出结构瓶颈时，才在允许范围内更改链结构。在六个基准测试和三个任务模型上，FAPO在18个模型-基准比较中的15个中击败了基线GEPA。在11个模型-基准比较中，FAPO以非重叠的均值±试验标准差范围胜出，FAPO相较于GEPA的平均增益为+14.1个百分点。在六个HoVer和IFBench比较中，当提示优先搜索升级为结构更改时，FAPO在所有六个比较中胜出，平均增益为+33.8个百分点。FAPO还提升了安全任务上的性能：在CTIBench-RCM（一个安全CVE到CWE映射任务）上，纯提示优化的FAPO在GPT-5上测试准确率提升+4.0个百分点，在Foundation-Sec-8B-Instruct上提升+7.1个百分点，在Foundation-Sec-8B-Reasoning上提升+2.0个百分点。这些结果将FAPO定位为通用任务和安全任务的最先进流水线优化技术。

English

Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean pm trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.