SkillOpt: 智能体技能自进化的执行策略

摘要

当前Agent技能要么是手工制作的，要么是一步生成的，要么是通过松散控制的自我修正演化而来——这些方法都不像技能深度学习优化器那样运作，也没有一种能在反馈下可靠地超越其起点。我们认为，技能应当作为冻结Agent的外部状态进行训练，遵循与权重空间优化相同的严谨原则，以确保可重复性。据我们所知，SkillOpt是首个针对Agent技能的系统化可控文本空间优化器：一个独立的优化器模型将带分数的展开结果转化为对单一技能文档的有界添加/删除/替换编辑，且仅当该编辑能严格提升保留验证分数时才被接受。文本学习率预算、被拒编辑缓冲区以及逐轮慢更新/元更新使得技能训练过程稳定，且在部署时无需增加任何推理时的模型调用。在六个基准测试、七个目标模型和三个执行框架（直接对话、Codex、Claude Code）中，SkillOpt在所有52个（模型、基准、框架）评估单元上取得最佳或并列最佳成绩，并在每个单元中击败了人类编写、一步式LLM生成、Trace2Skill、TextGrad、GEPA和EvoSkill等所有竞品技能。在GPT-5.5上，它将在直接对话中的无技能平均准确率提升了+23.5个百分点，在Codex Agent循环中提升了+24.8个百分点，在Claude Code中提升了+19.1个百分点。迁移实验进一步表明，优化后的技能制品在跨模型规模迁移、在Codex与Claude Code执行环境间迁移，以及迁移至邻近的数学基准测试（无需进一步优化）时，仍能保持其价值。

English

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization.