SkillOpt: 自己進化型エージェントスキルのための実行戦略

要旨

現在のエージェントスキルは、手作業で作成されるか、ワンショットで生成されるか、あるいは緩やかに制御された自己修正によって進化するものであり、そのいずれもスキルに対する深層学習のオプティマイザのように動作せず、フィードバックのもとで初期状態から確実に改善されるものではありません。我々は、スキルをフリーズされたエージェントの外部状態として訓練すべきであり、その際、重み空間の最適化を再現可能にするのと同じ原則を適用すべきだと主張する。我々の知る限り、SkillOptはエージェントスキルに対する初めての体系的で制御可能なテキスト空間オプティマイザである。別個のオプティマイザモデルが、スコア化されたロールアウトを、単一のスキル文書に対する制限付きの追加/削除/置換編集に変換し、編集は、保持された検証スコアを厳密に改善する場合にのみ受け入れられる。テキスト学習率バジェット、拒否編集バッファ、およびエポック単位の遅い/メタ更新により、スキル訓練を安定させると同時に、デプロイ時には推論時のモデル呼び出しを一切追加しない。6つのベンチマーク、7つのターゲットモデル、3つの実行ハーネス（直接チャット、Codex、Claude Code）にわたって、SkillOptは評価された52すべての（モデル、ベンチマーク、ハーネス）セルで最高または同点であり、人間、ワンショットLLM、Trace2Skill、TextGrad、GEPA、EvoSkillの各スキルの中から、各セルの競合相手すべてを上回る。GPT-5.5では、直接チャットでスキルなしの平均精度を+23.5ポイント引き上げ、Codexエージェントループ内で+24.8、Claude Code内で+19.1引き上げる。転送実験はさらに、最適化されたスキル成果物が、モデル規模をまたいで、CodexとClaude Codeの実行環境間で、そしてさらなる最適化なしで近接した数学ベンチマークに移した場合でも、価値を保持することを示している。

English

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization.