从原始经验到技能消费：模型生成的智能体技能的系统性研究

摘要

语言智能体越来越擅长通过复用技能来提升自身能力——这些技能是从过往经验中提炼出的结构化过程性构件。其中，领域级和模型生成的技能尤为值得关注。它们通过编码领域内特有的重复性流程，实现了在特定领域内的快速适应，并且能够超越劳动密集型的手工构建，实现规模化扩展。然而，尽管提取方法层出不穷，我们对其理解仍十分有限，至今尚未有全面覆盖技能完整生命周期（即经验生成、技能提取和技能消费）的综合研究来探讨：此类技能是否真的有效、在何种情况下有效，以及成功或失败的原因何在。为弥补这一空白，我们构建了一个基于实用性的评估框架，该框架在五个多样化的智能体任务领域上，提供了涉及不同提取器和目标智能体的系统性实验结果。我们发现，模型生成的技能总体上是有益的，但也表现出不容忽视的负迁移现象，且无论是提取器还是消费者，其行为都不具有一致性。某个模型可能是强大的提取器，却是薄弱的消费者，反之亦然，而技能的实用性与模型规模或基线任务表现无关。为解释这些规律，我们随后深入剖析了每个生命周期阶段，分析经验构成如何影响技能质量、有用技能具备哪些特征，以及同一技能在不同消费者之间的迁移表现。最后，我们将这些发现转化为一个具体的元技能，用于引导技能提取朝着与实际效用相关的特征方向发展，该方法在多个领域中持续提升了技能质量，并显著减少了负迁移现象。

English

Language agents increasingly improve by reusing skills -- structured procedural artifacts distilled from past experience. In particular, domain-level and model-generated skills are especially promising. They offer fast adaptation within a domain by encoding domain-specific recurring procedures, and they scale beyond labor-intensive hand-crafting. However, while extraction methods continue to proliferate, understanding remains limited, with no comprehensive study spanning the full skill lifecycle -- experience generation, skill extraction, and skill consumption -- to ask whether such skills actually work, when they work, and what makes them succeed or fail. To close this gap, we build a utility-grounded evaluation framework that provides systematic experimental results across extractors and target agents, covering five diverse agentic task domains. We find that model-generated skills are beneficial on average but exhibit non-trivial negative transfer, and that neither extractors nor targets behave uniformly. A model can be a strong extractor yet a weak consumer, or vice versa, with skill utility independent of model scale or baseline task strength. To explain these patterns, we then dissect each lifecycle stage in depth, analyzing how experience composition shapes skill quality, what properties characterize useful skills, and how the same skill transfers across different consumers. Finally, we translate these findings into a concrete meta-skill that guides skill extraction toward the features tied to actual utility, which consistently improves skill quality across domains and substantially reduces negative transfer.