SkillEvolBench：評測從情節經驗到程序性技能的演進

摘要

大型語言模型（LLM）智能體在解決真實世界任務時會累積豐富的情節軌跡，但這些經驗能否被提煉為可重複使用的程序性技能，目前仍不清楚。我們提出 SkillEvolBench，這是一個診斷性基準，用於評估從經驗重複使用到技能形成的這個步驟。該基準包含跨越六個真實世界智能體環境的 180 項任務，這些任務被組織成具有共享潛在程序的角色條件任務族。智能體從獲取任務中學習，利用壓縮軌跡和驗證器反饋更新外部技能庫，然後面對凍結部署任務，測試情境轉移、對抗性捷徑與組合能力。透過比較自生成技能與策劃初始技能的演化，並以無技能和原始軌跡作為對照組，SkillEvolBench 將程序抽象化能力與基礎能力、策劃先驗知識以及情節軌跡的直接重複使用區分開來。在十種模型配置與三種智能體框架的實驗中，我們發現當前智能體往往能進行局部適應，但很少能形成穩健的可重複使用技能。基於技能的條件可以改善獲取或回放表現，個別模型有時能在特定部署維度上取得進展，但在凍結部署條件下這些增益並不穩定。原始軌跡的重複使用經常優於提煉後的技能，這表明當前的抽象化程序丟棄了對未來任務仍有用的情境線索與程序性線索。容量與成本分析進一步表明，撰寫更多技能或更大的 Tier-3 資源庫並不足夠：額外的更新可以改善覆蓋範圍，但同時會引入特定於情節的偏移與程序性雜訊。這些發現將 SkillEvolBench 定位為一個實驗平台，用於衡量一次性的經驗何時會轉化為持久的程序性知識，而非僅限於任務局部的記憶。

English

Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusable procedural skills. We introduce SkillEvolBench, a diagnostic benchmark for evaluating this step from experience reuse to skill formation. It contains 180 tasks across six real-world agent environments, organized into role-conditioned task families with shared latent procedures. Agents learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks testing context shift, adversarial shortcuts, and composition. By comparing self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces. Across ten model configurations and three agent harnesses, we find that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes, but these gains are unstable under frozen deployment. Raw-trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks. Capacity and cost analyses further show that writing more skills or larger Tier-3 resource libraries is not sufficient: additional updates can improve coverage while introducing episode-specific drift and procedural clutter. These findings position SkillEvolBench as a testbed for measuring when one-off experience becomes durable procedural knowledge rather than task-local memory.