SkillEvolBench: エピソード経験から手続き的スキルへの進化のベンチマーク

要旨

大規模言語モデル（LLM）エージェントは、実世界のタスクを解決する過程で豊富なエピソード的軌跡を蓄積するが、そのような経験が再利用可能な手続き的スキルに蒸留され得るかどうかは未だ明らかではない。我々は、経験の再利用からスキル形成へのこのステップを評価するための診断ベンチマークであるSkillEvolBenchを導入する。本ベンチマークは、6つの実世界エージェント環境にわたる180のタスクで構成され、共有された潜在的手続きを持つ役割条件付きタスクファミリーに整理されている。エージェントは獲得タスクから学習し、圧縮された軌跡と検証器のフィードバックを用いて外部スキルライブラリを更新し、その後、文脈の変化、敵対的ショートカット、および構成をテストする固定展開タスクに直面する。自己生成およびキュレーションされたスターターのスキル進化を、スキルなしおよび生の軌跡の制御条件と比較することで、SkillEvolBenchは手続き的抽象化を、基本能力、キュレーションされた事前知識、およびエピソード的痕跡の直接的な再利用から分離する。10のモデル構成と3つのエージェントハーネスにわたって、現在のエージェントはしばしば局所的に適応するものの、頑健で再利用可能なスキルを形成することは稀であることが判明した。スキルベースの条件は獲得または再生を改善できる場合があり、個々のモデルが特定の展開軸で利得を得ることもあるが、これらの利得は固定展開下では不安定である。生の軌跡の再利用は蒸留されたスキルを頻繁に上回り、現在の抽象化手続きが将来のタスクに有用な文脈的および手続き的な手がかりを捨て去っていることを示唆している。容量とコストの分析はさらに、より多くのスキルやより大規模なTier-3リソースライブラリを書くだけでは不十分であることを示している。追加の更新はカバレッジを改善する一方で、エピソード固有のドリフトと手続き的混乱を導入する。これらの知見は、SkillEvolBenchを、一度限りの経験がタスク局所的な記憶ではなく耐久性のある手続き的知識となるタイミングを測定するためのテストベッドとして位置づけるものである。

English

Large language model (LLM) agents accumulate rich episodic trajectories while solving real-world tasks, but it remains unclear whether such experience can be distilled into reusable procedural skills. We introduce SkillEvolBench, a diagnostic benchmark for evaluating this step from experience reuse to skill formation. It contains 180 tasks across six real-world agent environments, organized into role-conditioned task families with shared latent procedures. Agents learn from acquisition tasks, update an external skill library using compacted trajectories and verifier feedback, and then face frozen deployment tasks testing context shift, adversarial shortcuts, and composition. By comparing self-generated and curated-start skill evolution against no-skill and raw-trajectory controls, SkillEvolBench separates procedural abstraction from base capability, curated prior knowledge, and direct reuse of episodic traces. Across ten model configurations and three agent harnesses, we find that current agents often adapt locally but rarely form robust reusable skills. Skill-based conditions can improve acquisition or replay, and individual models sometimes gain on specific deployment axes, but these gains are unstable under frozen deployment. Raw-trajectory reuse frequently outperforms distilled skills, suggesting that current abstraction procedures discard contextual and procedural cues that remain useful for future tasks. Capacity and cost analyses further show that writing more skills or larger Tier-3 resource libraries is not sufficient: additional updates can improve coverage while introducing episode-specific drift and procedural clutter. These findings position SkillEvolBench as a testbed for measuring when one-off experience becomes durable procedural knowledge rather than task-local memory.