MOCHA: 에이전트 스킬 최적화를 위한 다중 목적 체비셰프 어닐링

초록

LLM 에이전트는 스킬(skill)을 통해 동작을 구성한다. 스킬은 에이전트가 추론, 검색, 응답하는 방식을 규율하는 구조화된 자연어 명세이다. 단일 블록 프롬프트와 달리, 스킬은 다중 필드로 구성된 인공물(artifact)로서 하드(hard)한 플랫폼 제약을 받는다. 즉, 설명 필드는 라우팅을 위해 잘리고, 명령 본문은 점진적 공개(progressive disclosure)를 통해 압축되며, 공존하는 스킬들은 제한된 컨텍스트 윈도우를 두고 경쟁한다. 이러한 제약으로 인해 스킬 최적화는 본질적으로 다중 목적(multi-objective)이 된다. 즉, 스킬은 동시에 작업 성능을 극대화하고 플랫폼 한계를 충족해야 한다. 그러나 기존의 프롬프트 최적화기는 이러한 상충 관계(trade-off)를 무시하거나 가중 합(weighted sum)으로 단순화하여 비볼록 목적 영역에서 파레토 최적(Pareto-optimal) 변형을 놓친다. 본 논문에서는 MOCHA(Multi-Objective Chebyshev Annealing)를 소개한다. 이는 단일 목적 선택을 체비쇼프 스칼라화(Chebyshev scalarization)로 대체하여 비볼록 영역을 포함한 전체 파레토 프론트를 포괄하며, 탐색에서 활용으로 전환하는 지수적 어닐링(exponential annealing)을 결합한다. 여섯 가지 다양한 에이전트 스킬을 대상으로 한 실험에서(모든 방법이 동일한 다중 목적 변이 연산자를 공유하고, 기준선들은 목적별 텍스트 피드백을 동일하게 받음), 기존 최적화기들은 6개 작업 중 4개에서 시드 스킬(seed skill)을 개선하지 못했다. 즉, 1000회의 롤아웃(rollout)에서 진전이 전혀 없었다. MOCHA는 모든 작업에서 돌파구를 마련하여, 가장 강력한 기준선 대비 평균 정확도를 7.5% 상대 개선(특히 FEVER에서 14.9%, TheoremQA에서 10.4%)했으며, 파레토 최적 스킬 변형을 두 배 이상 더 발견했다.

English

LLM agents organize behavior through skills - structured natural-language specifications governing how an agent reasons, retrieves, and responds. Unlike monolithic prompts, skills are multi-field artifacts subject to hard platform constraints: description fields are truncated for routing, instruction bodies are compacted via progressive disclosure, and co-resident skills compete for limited context windows. These constraints make skill optimization inherently multi-objective: a skill must simultaneously maximize task performance and satisfy platform limits. Yet existing prompt optimizers either ignore these trade-offs or collapse them into a weighted sum, missing Pareto-optimal variants in non-convex objective regions. We introduce MOCHA (Multi-Objective Chebyshev Annealing), which replaces single-objective selection with Chebyshev scalarization - covering the full Pareto front, including non-convex regions - combined with exponential annealing that transitions from exploration to exploitation. In our experiments across six diverse agent skills - where all methods share the same multi-objective mutation operator and baselines receive identical per-objective textual feedback - existing optimizers fail to improve the seed skill on 4 of 6 tasks: 1000 rollouts yield zero progress. MOCHA breaks through on every task, achieving 7.5% relative improvement in mean correctness over the strongest baseline (up to 14.9% on FEVER and 10.4% on TheoremQA) while discovering twice as many more Pareto-optimal skill variants.