FAPO: Vollautonome Prompt-Optimierung von mehrstufigen LLM-Pipelines

Zusammenfassung

Mehrstufige LLM-Pipelines scheitern an Wechselwirkungen zwischen Abruf-, Denk- und Formatierungsschritten, sodass eine reine Prompt-Optimierung Engpässe in der Kette übersehen kann. Wir stellen FAPO (Fully Autonomous Prompt Optimization) vor, ein Framework, das Claude Code in die Lage versetzt, eine LLM-Pipeline innerhalb einer standardisierten Codebasis zu optimieren. FAPO bewertet eine Pipeline, prüft Zwischenschritte, diagnostiziert Fehler, schlägt gezielte Änderungen vor und validiert wiederholt Varianten, um sie gegen eine Bewertungsfunktion zu optimieren. Zunächst wird versucht, Prompts zu bearbeiten; erst wenn eine Prompt-Optimierung unzureichend erscheint, wird die Kettenstruktur innerhalb des zulässigen Rahmens geändert, sofern die Attribution einen strukturellen Engpass identifiziert. In sechs Benchmarks und mit drei Aufgabenmodellen übertrifft FAPO die Baseline GEPA in 15 von 18 Modell-Benchmark-Vergleichen. In 11 Modell-Benchmark-Vergleichen gewinnt FAPO mit nicht überlappenden Bereichen von Mittelwert ± Versuchs-Standardabweichung, und der mittlere FAPO-GEPA-Gewinn beträgt +14,1 Prozentpunkte. In den sechs HoVer- und IFBench-Vergleichen, bei denen die prompt-zentrierte Suche zu Strukturänderungen eskalierte, gewinnt FAPO alle sechs mit einem mittleren Gewinn von +33,8 Prozentpunkten. FAPO verbessert auch die Leistung bei Sicherheitsaufgaben: Bei CTIBench-RCM, einer Sicherheitsaufgabe zur Zuordnung von CVE zu CWE, erhöht die reine Prompt-Optimierung von FAPO die Testgenauigkeit um +4,0 Prozentpunkte auf GPT-5, um +7,1 Prozentpunkte auf Foundation-Sec-8B-Instruct und um +2,0 Prozentpunkte auf Foundation-Sec-8B-Reasoning. Diese Ergebnisse positionieren FAPO als eine hochmoderne Pipeline-Optimierungstechnik sowohl für allgemeine als auch für sicherheitsorientierte Aufgaben.

English

Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean pm trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.