OpenSkillEval:自動稽核LLM代理的開放技能生態系統
OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents
May 28, 2026
作者: Jiahao Ying, Boxian Ai, Wei Tang, Siyuan Liu, Yixin Cao
cs.AI
摘要
技能,即針對大型語言模型(LLMs)提煉的結構化工作流程指令,正成為提升代理在真實世界下游任務中表現的重要機制。然而,隨著開源技能生態系統迅速擴展,不同模型與代理框架如何與技能互動、如何評估技能品質,以及使用者應如何在實際成本與效能取捨下選擇技能,仍尚未明確。本文提出OpenSkillEval,一個針對技能增強型代理系統及技能本身的自動化評估框架。有別於依賴靜態基準測試,OpenSkillEval從五類下游應用(簡報生成、前端網頁設計、海報生成、資料視覺化與報告生成)中不斷演進的真實世界產物,自動建構實際任務實例。它進一步收集並整理社群貢獻的技能,以便在統一的任務設定下進行控制比較。我們利用超過600個動態生成的任務實例與30個開源技能,對當前最先進的模型與代理框架進行系統性評估。結果顯示:具備技能並不保證有效使用技能;技能增強的效益強烈取決於底層模型與代理框架;許多廣受歡迎的技能並未持續優於不具技能的基礎代理。這些發現凸顯了動態、基於任務的評估之必要,並為LLM代理的技能設計、選擇與部署提供實務見解。更多案例與基準資源可於專案網站取得:https://yingjiahao14.github.io/OpenSkillEval-Web/。
English
Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real-world downstream tasks. However, as the open-source skill ecosystem rapidly expands, it remains unclear how different models and agent frameworks interact with skills, how to evaluate skill quality, and how users should select skills under practical cost-performance trade-offs. In this paper, we present OpenSkillEval, an automatic evaluation framework for both skill-augmented agent systems and the skills themselves. Instead of relying on static benchmarks, OpenSkillEval automatically constructs realistic task instances from evolving real-world artifacts across five categories of downstream applications: presentation generation, front-end web design, poster generation, data visualization, and report generation. It further collects and organizes community-contributed skills for controlled comparison under unified task settings. Using more than 600 dynamically generated task instances and 30 open-source skills, we conduct a systematic evaluation of state-of-the-art models and agent frameworks. Our results show that skill availability does not guarantee effective skill usage, that the benefit of skill augmentation depends strongly on both the underlying model and the agent framework, and that many publicly popular skills do not consistently outperform base agents without skills. These findings highlight the need for dynamic, task-grounded evaluation and provide practical insights into the design, selection, and deployment of skills for LLM agents. Additional cases and benchmark resources are available on the project website: https://yingjiahao14.github.io/OpenSkillEval-Web/.