OpenSkillEval: LLMエージェントのためのオープンスキルエコシステムの自動監査

要旨

スキル、すなわち大規模言語モデル（LLM）向けに精緻化された構造化ワークフロー指示は、現実世界の下流タスクにおけるエージェントの性能向上のための重要なメカニズムとして注目されている。しかし、オープンソースのスキルエコシステムが急速に拡大する中で、異なるモデルやエージェントフレームワークがスキルとどのように相互作用するのか、スキルの品質をどのように評価するのか、また、実用的なコストパフォーマンスのトレードオフの下でユーザーがどのようにスキルを選択すべきかは、依然として明確ではない。本稿では、スキル拡張型エージェントシステムとスキル自体の両方を対象とした自動評価フレームワークであるOpenSkillEvalを提案する。OpenSkillEvalは静的ベンチマークに依存するのではなく、プレゼンテーション生成、フロントエンドWebデザイン、ポスター生成、データ可視化、レポート生成という5カテゴリの下流アプリケーションにわたって、進化する実世界の成果物から現実的なタスクインスタンスを自動構築する。さらに、コミュニティから寄せられたスキルを収集・整理し、統一されたタスク設定の下で制御可能な比較を可能にする。600以上の動的に生成されたタスクインスタンスと30のオープンソーススキルを用いて、最先端のモデルとエージェントフレームワークの体系的な評価を実施した。結果として、スキルの存在が効果的なスキル利用を保証するわけではないこと、スキル拡張の利点は基盤となるモデルとエージェントフレームワークの両方に強く依存すること、そして、多くの公開され広く利用されているスキルが、スキルを持たないベースエージェントを一貫して上回るわけではないことが明らかになった。これらの知見は、動的かつタスクに根ざした評価の必要性を強調し、LLMエージェント向けスキルの設計、選択、展開に関する実践的な洞察を提供する。追加のケーススタディやベンチマークリソースはプロジェクトウェブサイト（https://yingjiahao14.github.io/OpenSkillEval-Web/）で公開されている。

English

Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real-world downstream tasks. However, as the open-source skill ecosystem rapidly expands, it remains unclear how different models and agent frameworks interact with skills, how to evaluate skill quality, and how users should select skills under practical cost-performance trade-offs. In this paper, we present OpenSkillEval, an automatic evaluation framework for both skill-augmented agent systems and the skills themselves. Instead of relying on static benchmarks, OpenSkillEval automatically constructs realistic task instances from evolving real-world artifacts across five categories of downstream applications: presentation generation, front-end web design, poster generation, data visualization, and report generation. It further collects and organizes community-contributed skills for controlled comparison under unified task settings. Using more than 600 dynamically generated task instances and 30 open-source skills, we conduct a systematic evaluation of state-of-the-art models and agent frameworks. Our results show that skill availability does not guarantee effective skill usage, that the benefit of skill augmentation depends strongly on both the underlying model and the agent framework, and that many publicly popular skills do not consistently outperform base agents without skills. These findings highlight the need for dynamic, task-grounded evaluation and provide practical insights into the design, selection, and deployment of skills for LLM agents. Additional cases and benchmark resources are available on the project website: https://yingjiahao14.github.io/OpenSkillEval-Web/.