OpenSkillEval:自动审计大语言模型智能体的开放技能生态系统
OpenSkillEval: Automatically Auditing the Open Skill Ecosystem for LLM Agents
May 28, 2026
作者: Jiahao Ying, Boxian Ai, Wei Tang, Siyuan Liu, Yixin Cao
cs.AI
摘要
技能,即针对大型语言模型(LLMs)提炼的结构化工作流指令,正成为提升智能体在实际下游任务中性能的重要机制。然而,随着开源技能生态的快速扩展,不同模型和智能体框架如何与技能交互、如何评估技能质量、以及用户如何在成本-性能权衡下选择技能等问题仍不明确。本文提出OpenSkillEval——一个面向技能增强型智能体系统及技能本身的自动评估框架。与依赖静态基准不同,OpenSkillEval能够从持续演变的人工制品中自动构建涵盖五类下游应用(演示文稿生成、前端网页设计、海报生成、数据可视化及报告生成)的实例任务。该框架进一步收集并整理了社区贡献的技能,以在统一任务设置下进行受控比较。我们利用600余个动态生成的实例任务和30个开源技能,对当前最先进的模型和智能体框架进行了系统评估。结果表明:技能可用性并不保证其有效使用;技能增强的收益高度依赖于底层模型和智能体框架;许多广受欢迎的技能在无技能基础智能体面前并未持续展现优势。这些发现揭示了动态任务导向型评估的必要性,并为LLM智能体技能的设计、选择与部署提供了实践洞见。更多案例与基准资源详见项目网站:https://yingjiahao14.github.io/OpenSkillEval-Web/。
English
Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real-world downstream tasks. However, as the open-source skill ecosystem rapidly expands, it remains unclear how different models and agent frameworks interact with skills, how to evaluate skill quality, and how users should select skills under practical cost-performance trade-offs. In this paper, we present OpenSkillEval, an automatic evaluation framework for both skill-augmented agent systems and the skills themselves. Instead of relying on static benchmarks, OpenSkillEval automatically constructs realistic task instances from evolving real-world artifacts across five categories of downstream applications: presentation generation, front-end web design, poster generation, data visualization, and report generation. It further collects and organizes community-contributed skills for controlled comparison under unified task settings. Using more than 600 dynamically generated task instances and 30 open-source skills, we conduct a systematic evaluation of state-of-the-art models and agent frameworks. Our results show that skill availability does not guarantee effective skill usage, that the benefit of skill augmentation depends strongly on both the underlying model and the agent framework, and that many publicly popular skills do not consistently outperform base agents without skills. These findings highlight the need for dynamic, task-grounded evaluation and provide practical insights into the design, selection, and deployment of skills for LLM agents. Additional cases and benchmark resources are available on the project website: https://yingjiahao14.github.io/OpenSkillEval-Web/.