從原始經驗到技能消費：模型生成代理技能的系統研究

摘要

語言代理越來越依賴於重複使用技能——即從過往經驗中提煉出的結構化程序化產物。特別是領域層級與模型生成的技能尤其具有前景：它們透過編碼特定領域的重複性程序，實現快速適應，且能擴展至超越耗時的人工構建。然而，儘管提取方法持續增加，我們對其理解仍相當有限，缺乏涵蓋完整技能生命週期（經驗生成、技能提取、技能消費）的全面研究，來探討這些技能是否真正有效、何時有效，以及成功或失敗的原因。為填補這項缺口，我們建立了一個以實用性為基礎的評估框架，在五個多樣化的代理任務領域中，提供跨提取器與目標代理的系統性實驗結果。我們發現，模型生成的技能平均而言是有益的，但存在顯著的負遷移現象，且無論提取器或目標代理的行為均非一致。一個模型可能成為強大的提取器卻同時是弱勢的消費者，反之亦然；技能的實用性與模型規模或基準任務強度並無關聯。為解釋這些模式，我們接著深入剖析生命週期的每個階段，分析經驗組成如何塑造技能品質、哪些特性定義了有用的技能，以及同一技能如何在不同消費者之間轉移。最後，我們將這些發現轉化為具體的後設技能，引導技能提取朝向與實際效用相關的特徵，在不同領域中持續提升技能品質，並大幅減少負遷移。

English

Language agents increasingly improve by reusing skills -- structured procedural artifacts distilled from past experience. In particular, domain-level and model-generated skills are especially promising. They offer fast adaptation within a domain by encoding domain-specific recurring procedures, and they scale beyond labor-intensive hand-crafting. However, while extraction methods continue to proliferate, understanding remains limited, with no comprehensive study spanning the full skill lifecycle -- experience generation, skill extraction, and skill consumption -- to ask whether such skills actually work, when they work, and what makes them succeed or fail. To close this gap, we build a utility-grounded evaluation framework that provides systematic experimental results across extractors and target agents, covering five diverse agentic task domains. We find that model-generated skills are beneficial on average but exhibit non-trivial negative transfer, and that neither extractors nor targets behave uniformly. A model can be a strong extractor yet a weak consumer, or vice versa, with skill utility independent of model scale or baseline task strength. To explain these patterns, we then dissect each lifecycle stage in depth, analyzing how experience composition shapes skill quality, what properties characterize useful skills, and how the same skill transfers across different consumers. Finally, we translate these findings into a concrete meta-skill that guides skill extraction toward the features tied to actual utility, which consistently improves skill quality across domains and substantially reduces negative transfer.