生の経験からスキル消費へ：モデル生成エージェントスキルに関する体系的研究

要旨

言語エージェントは、過去の経験から抽出された構造化された手続き的成果物であるスキルを再利用することで、ますます改善されている。特に、ドメインレベルのスキルとモデル生成スキルは有望である。これらは、ドメイン固有の反復手順を符号化することでドメイン内での迅速な適応を可能にし、労力を要する手作業を超えてスケールする。しかし、抽出方法が増え続けている一方で、理解は限られたままであり、スキルのライフサイクル全体（経験生成、スキル抽出、スキル消費）を網羅して、そのようなスキルが実際に機能するのか、いつ機能するのか、何が成功または失敗の要因なのかを問う包括的な研究は存在しない。このギャップを埋めるために、我々は、抽出器と対象エージェントにわたって体系的な実験結果を提供し、5つの多様なエージェント型タスクドメインをカバーする、実用性に基づく評価フレームワークを構築する。我々は、モデル生成スキルは平均的には有益であるが、無視できない負の転移を示すこと、また抽出器も対象エージェントも一様に振る舞うわけではないことを発見した。あるモデルは強力な抽出器である一方で弱い消費者であることも、その逆もあり得る。スキルの有用性はモデル規模やベースラインタスクの強度とは無関係である。これらのパターンを説明するために、次に各ライフサイクル段階を詳細に分解し、経験の構成がどのようにスキルの品質を形成するか、有用なスキルを特徴付ける特性は何か、同じスキルが異なる消費者間でどのように転移するかを分析する。最後に、これらの知見を、実際の有用性に関連する特徴へとスキル抽出を導く具体的なメタスキルに変換する。これにより、ドメイン全体でスキルの品質が一貫して向上し、負の転移が大幅に低減される。

English

Language agents increasingly improve by reusing skills -- structured procedural artifacts distilled from past experience. In particular, domain-level and model-generated skills are especially promising. They offer fast adaptation within a domain by encoding domain-specific recurring procedures, and they scale beyond labor-intensive hand-crafting. However, while extraction methods continue to proliferate, understanding remains limited, with no comprehensive study spanning the full skill lifecycle -- experience generation, skill extraction, and skill consumption -- to ask whether such skills actually work, when they work, and what makes them succeed or fail. To close this gap, we build a utility-grounded evaluation framework that provides systematic experimental results across extractors and target agents, covering five diverse agentic task domains. We find that model-generated skills are beneficial on average but exhibit non-trivial negative transfer, and that neither extractors nor targets behave uniformly. A model can be a strong extractor yet a weak consumer, or vice versa, with skill utility independent of model scale or baseline task strength. To explain these patterns, we then dissect each lifecycle stage in depth, analyzing how experience composition shapes skill quality, what properties characterize useful skills, and how the same skill transfers across different consumers. Finally, we translate these findings into a concrete meta-skill that guides skill extraction toward the features tied to actual utility, which consistently improves skill quality across domains and substantially reduces negative transfer.