SkillOpt: 자기 진화 에이전트 스킬을 위한 실행 전략

초록

오늘날의 에이전트 스킬은 수작업으로 제작되거나, 원샷(one-shot)으로 생성되거나, 느슨하게 통제된 자체 수정을 통해 진화하는 방식으로 만들어지며, 이 중 어느 것도 스킬에 대한 딥러닝 최적화기처럼 동작하지 않으며, 피드백 하에서 시작점 대비 신뢰할 수 있는 개선을 보장하지 않는다. 우리는 스킬이 대신 고정된 에이전트의 외부 상태로 훈련되어야 하며, 가중치 공간 최적화를 재현 가능하게 만드는 것과 동일한 규율을 적용해야 한다고 주장한다. 본 연구에서 제안하는 SkillOpt는, 저희가 아는 한, 에이전트 스킬을 위한 최초의 체계적이고 통제 가능한 텍스트 공간 최적화기이다. 별도의 최적화 모델이 점수가 매겨진 롤아웃(scored rollouts)을 단일 스킬 문서에 대한 제한된 추가/삭제/대체 편집으로 변환하며, 편집은 보류된 검증 점수(held-out validation score)를 엄격히 개선할 때만 수락된다. 텍스트 학습률 예산(textual learning-rate budget), 거부된 편집 버퍼(rejected-edit buffer), 에폭별 느린/메타 업데이트(epoch-wise slow/meta update)는 배포 시점에 추론 호출을 전혀 추가하지 않으면서 스킬 훈련을 안정적으로 만든다. 6개 벤치마크, 7개 대상 모델, 3개 실행 환경(직접 채팅, Codex, Claude Code)에 걸쳐, SkillOpt는 평가된 52개 모든 (모델, 벤치마크, 환경) 셀에서 최고 성능을 기록하거나 동률을 이루었으며, 인간, 원샷 LLM, Trace2Skill, TextGrad, GEPA, EvoSkill 스킬 중 모든 셀별 경쟁자를 능가했다. GPT-5.5에서는 직접 채팅에서 기본 무스킬 정확도 대비 평균 +23.5포인트, Codex 에이전틱 루프 내에서 +24.8포인트, Claude Code 내에서 +19.1포인트 향상시켰다. 또한 전이 실험은 최적화된 스킬 아티팩트가 모델 규모 간, Codex와 Claude Code 실행 환경 간, 그리고 추가 최적화 없이 유사한 수학 벤치마크로 이동될 때도 가치를 유지함을 보여준다.

English

Agent skills today are hand-crafted, generated one-shot, or evolved through loosely controlled self-revision, none of which behaves like a deep-learning optimizer for the skill, and none of which reliably improves over its starting point under feedback. We argue the skill should instead be trained as the external state of a frozen agent, with the same discipline that makes weight-space optimization reproducible. SkillOpt is, to our knowledge, the first systematic controllable text-space optimizer for agent skills: a separate optimizer model turns scored rollouts into bounded add/delete/replace edits on a single skill document, and an edit is accepted only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta update make skill training stable while adding zero inference-time model calls at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated (model, benchmark, harness) cells and beats every per-cell competitor among human, one-shot LLM, Trace2Skill, TextGrad, GEPA, and EvoSkill skills. On GPT-5.5 it lifts the average no-skill accuracy by +23.5 points in direct chat, by +24.8 inside the Codex agentic loop, and by +19.1 inside Claude Code. Transfer experiments further show that optimized skill artifacts retain value when moved across model scales, between Codex and Claude Code execution environments, and to a nearby math benchmark without further optimization.