FAPO: Volledig Autonome Promptoptimalisatie van Meerstaps-LLM-pijplijnen

Samenvatting

Meerstaps-LLM-pijplijnen falen door interacties tussen retrieval-, redeneer- en opmaakstappen, waardoor optimalisatie uitsluitend via prompts bottlenecks in de keten kan missen. Wij presenteren FAPO (Fully Autonomous Prompt Optimization), een framework waarmee Claude Code een LLM-pijplijn kan optimaliseren binnen een gestandaardiseerde codebase. FAPO evalueert een pijplijn, inspecteert tussenstappen, diagnosticeert fouten, stelt gerichte wijzigingen voor en valideert varianten herhaaldelijk om te optimaliseren tegen een scorefunctie. Het probeert eerst promptbewerkingen en, alleen wanneer promptoptimalisatie ontoereikend lijkt, verandert het de ketenstructuur binnen de toegestane reikwijdte wanneer attributie een structurele bottleneck identificeert. Over zes benchmarks en drie taakmodellen heen verslaat FAPO de baseline GEPA in 15 van de 18 model-benchmarkvergelijkingen. In 11 model-benchmarkvergelijkingen wint FAPO met niet-overlappende bereiken van gemiddelde ± trial-standaarddeviatie, en de gemiddelde FAPO-GEPA-winst bedraagt +14,1 procentpunt. In de zes HoVer- en IFBench-vergelijkingen waarin prompt-first zoeken escaleerde naar structurele wijzigingen, wint FAPO alle zes met een gemiddelde winst van +33,8 procentpunt. FAPO verbetert ook de prestaties op beveiligingstaken: op CTIBench-RCM, een security CVE-naar-CWE-taak, verhoogt alleen-prompt FAPO de testnauwkeurigheid met +4,0 procentpunt op GPT-5, +7,1 procentpunt op Foundation-Sec-8B-Instruct en +2,0 procentpunt op Foundation-Sec-8B-Reasoning. Deze resultaten positioneren FAPO als een state-of-the-art pijplijnoptimalisatietechniek voor zowel algemene als beveiligingsgerichte taken.

English

Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean pm trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.