ChatPaper.aiChatPaper

FAPO:多步骤大语言模型流水线的全自主提示优化

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

June 17, 2026
作者: Paul Kassianik, Baturay Saglam, Huaibo Zhao, Blaine Nelson, Supriti Vijay, Aman Priyanshu, Amin Karbasi
cs.AI

摘要

多步LLM流水线因检索、推理和格式化步骤之间的交互而失败,因此仅通过提示优化可能遗漏链中的瓶颈。我们提出FAPO(全自动提示优化),这是一个框架,能让Claude Code在标准化代码库中优化LLM流水线。FAPO评估流水线、检查中间步骤、诊断故障、提出局部修改建议,并反复验证变体,以针对评分函数进行优化。它首先尝试提示编辑,仅当提示优化不足且归因分析识别出结构瓶颈时,才在允许范围内更改链结构。在六个基准测试和三个任务模型上,FAPO在18个模型-基准比较中的15个中击败了基线GEPA。在11个模型-基准比较中,FAPO以非重叠的均值±试验标准差范围胜出,FAPO相较于GEPA的平均增益为+14.1个百分点。在六个HoVer和IFBench比较中,当提示优先搜索升级为结构更改时,FAPO在所有六个比较中胜出,平均增益为+33.8个百分点。FAPO还提升了安全任务上的性能:在CTIBench-RCM(一个安全CVE到CWE映射任务)上,纯提示优化的FAPO在GPT-5上测试准确率提升+4.0个百分点,在Foundation-Sec-8B-Instruct上提升+7.1个百分点,在Foundation-Sec-8B-Reasoning上提升+2.0个百分点。这些结果将FAPO定位为通用任务和安全任务的最先进流水线优化技术。
English
Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean pm trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.