FAPO: 다단계 LLM 파이프라인의 완전 자동 프롬프트 최적화

초록

다단계 LLM 파이프라인은 검색, 추론, 형식화 단계 간 상호작용으로 인해 실패하므로, 프롬프트만 최적화하는 방식은 체인의 병목 지점을 놓칠 수 있습니다. 본 논문에서는 Claude Code가 표준화된 코드베이스 내에서 LLM 파이프라인을 최적화할 수 있는 프레임워크인 FAPO(완전 자율 프롬프트 최적화)를 제시합니다. FAPO는 파이프라인을 평가하고, 중간 단계를 검사하며, 실패를 진단하고, 범위가 제한된 변경 사항을 제안한 후, 변형을 반복적으로 검증하여 점수 함수에 대해 최적화를 수행합니다. 먼저 프롬프트 편집을 시도하고, 프롬프트 최적화만으로 충분하지 않다고 판단될 때만(속성 분석 결과 구조적 병목 현상이 식별된 경우) 허용 범위 내에서 체인 구조를 변경합니다. 6개의 벤치마크와 3개의 태스크 모델에 걸쳐, FAPO는 18개 모델-벤치마크 비교 중 15개에서 기준 모델인 GEPA를 능가합니다. 11개의 모델-벤치마크 비교에서 FAPO는 평균 ± 시행 표준편차 범위가 겹치지 않는 차이로 승리했으며, 평균 FAPO-GEPA 이득은 +14.1%p입니다. 프롬프트 우선 탐색이 구조적 변경으로 확대된 6개의 HoVer 및 IFBench 비교에서는 FAPO가 모두 승리하여 평균 +33.8%p의 이득을 기록했습니다. FAPO는 보안 태스크에서도 성능을 향상시킵니다. 보안 CVE-to-CWE 태스크인 CTIBench-RCM에서 프롬프트 전용 FAPO는 GPT-5에서 테스트 정확도를 +4.0%p, Foundation-Sec-8B-Instruct에서 +7.1%p, Foundation-Sec-8B-Reasoning에서 +2.0%p 향상시켰습니다. 이러한 결과는 FAPO가 범용 및 보안 중심 태스크 모두에 대한 최신 기술 수준의 파이프라인 최적화 기법임을 입증합니다.

English

Multi-step LLM pipelines fail through interactions among retrieval, reasoning, and formatting steps, so prompt-only optimization can miss bottlenecks in the chain. We present FAPO (Fully Autonomous Prompt Optimization), a framework that lets Claude Code optimize an LLM pipeline inside a standardized codebase. FAPO evaluates a pipeline, inspects intermediate steps, diagnoses failures, proposes scoped changes, and validates variants repeatedly to optimize against a score function. It first tries prompt edits and, only when prompt optimization appears insufficient, changes chain structure within the permitted scope when attribution identifies a structural bottleneck. Across six benchmarks and three task models, FAPO beats the baseline GEPA in 15 of 18 model-benchmark comparisons. In 11 model-benchmark comparisons, FAPO wins with non-overlapping mean pm trial-standard-deviation ranges, and the mean FAPO-GEPA gain is +14.1 pp. In the six HoVer and IFBench comparisons where prompt-first search escalated to structural changes, FAPO wins all six with a mean gain of +33.8 pp. FAPO also improves performance on security tasks: on CTIBench-RCM, a security CVE-to-CWE task, prompt-only FAPO lifts test accuracy by +4.0 pp on GPT-5, +7.1 pp on Foundation-Sec-8B-Instruct, and +2.0 pp on Foundation-Sec-8B-Reasoning. These results position FAPO as a state-of-the-art pipeline optimization technique for both general-purpose and security-focused tasks.