MOCHA: 多目的チェビシェフアニーリングによるエージェントスキル最適化

要旨

LLMエージェントは、スキル（エージェントの推論、情報検索、応答の方法を規定する構造化された自然言語仕様）を通じて行動を組織化する。モノリシックなプロンプトとは異なり、スキルは複数のフィールドから構成される成果物であり、プラットフォームの厳格な制約を受ける。すなわち、説明フィールドはルーティングのために切り詰められ、命令本体は段階的開示によって圧縮され、同じ環境に共存するスキルは限られたコンテキストウィンドウを競い合う。こうした制約により、スキル最適化は本質的に多目的となる。すなわち、スキルはタスク性能の最大化とプラットフォーム制約の充足を同時に達成しなければならない。しかし、既存のプロンプト最適化手法は、これらのトレードオフを無視するか、重み付き和に縮約することで非凸な目的領域におけるパレート最適なバリアントを見逃している。そこで我々はMOCHA（Multi-Objective Chebyshev Annealing）を提案する。本手法は、単一目的による選択をチェビシェフスカラー化（非凸領域を含むパレートフロント全体をカバー）に置き換え、さらに指数関数的アニーリング（探索から活用への移行）を組み合わせる。多様な6種類のエージェントスキルを用いた実験（すべての手法が同一の多目的突然変異オペレータを共有し、ベースラインも各目的ごとに同一のテキストフィードバックを受ける）において、既存の最適化手法は6タスク中4タスクでシードスキルの改善に失敗した。すなわち、1000回のロールアウトで進歩がゼロだった。MOCHAは全タスクでこの壁を突破し、最も強力なベースラインと比較して平均正解率を7.5%相対改善（FEVERでは最大14.9%、TheoremQAでは10.4%）、さらに2倍以上のパレート最適なスキルバリアントを発見した。

English

LLM agents organize behavior through skills - structured natural-language specifications governing how an agent reasons, retrieves, and responds. Unlike monolithic prompts, skills are multi-field artifacts subject to hard platform constraints: description fields are truncated for routing, instruction bodies are compacted via progressive disclosure, and co-resident skills compete for limited context windows. These constraints make skill optimization inherently multi-objective: a skill must simultaneously maximize task performance and satisfy platform limits. Yet existing prompt optimizers either ignore these trade-offs or collapse them into a weighted sum, missing Pareto-optimal variants in non-convex objective regions. We introduce MOCHA (Multi-Objective Chebyshev Annealing), which replaces single-objective selection with Chebyshev scalarization - covering the full Pareto front, including non-convex regions - combined with exponential annealing that transitions from exploration to exploitation. In our experiments across six diverse agent skills - where all methods share the same multi-objective mutation operator and baselines receive identical per-objective textual feedback - existing optimizers fail to improve the seed skill on 4 of 6 tasks: 1000 rollouts yield zero progress. MOCHA breaks through on every task, achieving 7.5% relative improvement in mean correctness over the strongest baseline (up to 14.9% on FEVER and 10.4% on TheoremQA) while discovering twice as many more Pareto-optimal skill variants.