SkillEvolBench：从情景经验到程序性技能的进化基准测试

摘要

大型语言模型（LLM）智能体在解决现实任务时会积累丰富的回合式轨迹，但这些经验能否被提炼为可重复使用的程序化技能仍不明确。我们提出了SkillEvolBench，一个用于评估从经验复用走向技能形成这一环节的诊断性基准。该基准包含跨越六个真实智能体环境的180个任务，这些任务被组织成基于角色条件、共享潜在程序的任务族。智能体从习得任务中学习，利用压缩轨迹和验证器反馈更新外部技能库，随后在冻结部署任务中面临上下文迁移、对抗性捷径及组合等挑战。通过将自生成技能演进和策划初始技能演进与无技能基线及原始轨迹基线进行对比，SkillEvolBench能够将程序化抽象能力与基础能力、策划先验知识以及回合式痕迹的直接复用分离开来。在十种模型配置和三种智能体框架下，我们发现当前智能体往往仅能进行局部适应，极少形成稳健的可复用技能。基于技能的条件可以改进习得或回放过程，个别模型有时也能在特定部署维度上取得提升，但这些提升在冻结部署条件下并不稳定。原始轨迹复用的表现通常优于提炼后的技能，这表明当前的抽象过程丢弃了对未来任务仍有用的上下文和程序线索。能力和成本分析进一步表明，编写更多技能或更大的三级资源库并不足够：额外的更新虽能提升覆盖范围，却会引入回合特定偏移和程序杂乱。这些发现将SkillEvolBench定位为一个衡量一次性经验何时能转化为持久程序化知识而非任务局部记忆的测试平台。

English

Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusable procedural skills. We introduce SkillEvolBench, a diagnostic benchmark for evaluating this step from experience reuse to skill formation. It contains 180 tasks across six real-world agent environments, organized into role-conditioned task families with shared latent procedures. Agents learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks testing context shift, adversarial shortcuts, and composition. By comparing self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces. Across ten model configurations and three agent harnesses, we find that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes, but these gains are unstable under frozen deployment. Raw-trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks. Capacity and cost analyses further show that writing more skills or larger Tier-3 resource libraries is not sufficient: additional updates can improve coverage while introducing episode-specific drift and procedural clutter. These findings position SkillEvolBench as a testbed for measuring when one-off experience becomes durable procedural knowledge rather than task-local memory.