SkillGrad：像梯度下降一样优化智能体技能

摘要

智能体技能通过将可复用的程序性知识存储在结构化文件中，为将大语言模型适应特定领域提供了一种轻量化方式。然而，无论是从第三方下载还是自行生成，这些技能往往存在不可靠、不完整或过时的问题。现有的技能进化方法通常通过启发式反思来弥补这些不足，但缺乏明确的优化框架。本文提出SkillGrad，一个受梯度下降启发的智能体技能优化框架。SkillGrad将技能包视为待优化的结构化参数，采用梯度下降方式：任务执行提供轨迹级别的损失证据，自动诊断则提供指示修正方向的文本式梯度。为稳定跨迭代的优化过程，一个动量代理将重复出现的诊断模式累积到持久记忆覆盖层中。最后，基于大语言模型的修补器通过对技能包进行层感知编辑来执行参数更新。在SpreadsheetBench Verified和WikiTableQuestions上的评估表明，SkillGrad在两种基础大语言模型上始终优于基于训练的技能进化基线，平均比最强的训练基线高出6.7个百分点。消融实验进一步显示，动量机制和对比诊断均对最终技能质量有贡献。

English

Agent skills provide a lightweight way to adapt LLM agents to specialized domains by storing reusable procedural knowledge in structured files. However, whether downloaded from third parties or self-generated, these skills are often unreliable, incomplete, or outdated. Existing skill-evolution methods often address these deficiencies through heuristic reflections without an explicit optimization formulation. In this paper, we propose SkillGrad, a gradient-descent-inspired framework for optimizing agent skills. SkillGrad treats the skill package as a structured parameter to optimize in a gradient descent fashion: task executions provide trajectory-level loss evidence, automatic diagnoses then provide text-based gradients that indicate the correction directions. To stabilize optimization across iterations, a momentum agent accumulates recurring diagnostic patterns into a persistent memory overlay. Finally, an LLM-based patcher executes the parameter update by applying layer-aware edits to the skill package. Evaluated on SpreadsheetBench Verified and WikiTableQuestions, SkillGrad consistently outperforms training-based skill evolution baselines across two backbone LLMs, improving over the strongest training-based baseline by 6.7 percentage points on average. Ablations further show that momentum and contrastive diagnosis both contribute to the final skill quality.