FAPO: 完全自律型マルチステップLLMパイプラインのプロンプト最適化

要旨

多段LLMパイプラインは、検索・推論・整形の各ステップ間の相互作用によって失敗するため、プロンプトのみの最適化ではチェーン内のボトルネックを見逃す可能性がある。本稿では、Claude Codeが標準化されたコードベース内でLLMパイプラインを最適化できるフレームワークであるFAPO（完全自律型プロンプト最適化）を提案する。FAPOはパイプラインを評価し、中間ステップを検査し、失敗を診断し、スコープを限定した変更を提案し、バリアントを繰り返し検証することで、スコア関数に対して最適化を行う。まずはプロンプト編集を試み、プロンプト最適化だけでは不十分と判断された場合にのみ、属性分析によって構造的ボトルネックが特定されたとき、許可されたスコープ内でチェーン構造を変更する。6つのベンチマークと3つのタスクモデルを用いた評価では、FAPOは18のモデル・ベンチマーク比較のうち15でベースラインのGEPAを上回った。11のモデル・ベンチマーク比較では、平均±試行標準偏差の範囲が重複しない形でFAPOが勝利し、平均FAPO-GEPA利得は+14.1ポイントであった。プロンプト優先探索が構造変更に発展した6つのHoVerおよびIFBench比較では、FAPOは全6件で勝利し、平均利得は+33.8ポイントであった。また、セキュリティタスクにおいても性能向上を達成した。セキュリティ上のCVEからCWEへのタスクであるCTIBench-RCMでは、プロンプトのみのFAPOにより、GPT-5でテスト精度が+4.0ポイント、Foundation-Sec-8B-Instructで+7.1ポイント、Foundation-Sec-8B-Reasoningで+2.0ポイント向上した。これらの結果により、FAPOは汎用タスクおよびセキュリティ特化タスクの両方において、最先端のパイプライン最適化手法として位置づけられる。

English

Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean pm trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.