SkillOpt：自我進化型代理技能的執行策略

摘要

當前的代理技能（Agent skills）多為手工構建、一次性生成，或透過鬆散控制的自我修正進行演化。這些方法皆未表現出類似深度學習優化器針對技能進行優化的特性，也無法在回饋下可靠地從起點持續改進。我們主張技能應被訓練為凍結代理（frozen agent）的外部狀態，並遵循與權重空間優化（weight-space optimization）同等嚴謹的紀律。據我們所知，SkillOpt 是首個針對代理技能的系統性可控文本空間優化器：一個獨立的優化器模型將評分後的執行軌跡（scored rollouts）轉換為對單一技能文件的有限添加/刪除/替換編輯（bounded add/delete/replace edits），且僅當編輯能嚴格提升保留驗證集（held-out validation）分數時才予以接受。文本學習率預算（textual learning-rate budget）、拒絕編輯緩衝區（rejected-edit buffer）以及逐輪次的慢速/元更新（epoch-wise slow/meta update）使得技能訓練穩定，同時在部署時不增加任何推理時的模型調用。在六個基準測試、七個目標模型以及三個執行框架（直接對話、Codex、Claude Code）中，SkillOpt 在全部 52 個評估（模型、基準、框架）單元中表現最佳或持平，並在所有單元中擊敗了包括人類、一次性 LLM、Trace2Skill、TextGrad、GEPA 及 EvoSkill 技能在內的每一個競爭者。在 GPT-5.5 上，它將直接對話中的無技能平均準確率提升了 23.5 個百分點，在 Codex 代理循環（Codex agentic loop）內提升了 24.8 個百分點，在 Claude Code 中提升了 19.1 個百分點。遷移實驗進一步顯示，優化後的技能工件（skill artifacts）在跨模型規模遷移、在 Codex 與 Claude Code 執行環境間遷移，以及遷移至鄰近數學基準測試而無需進一步優化時，仍能保持其價值。

English

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization.