OpenSkillEval: LLM 에이전트를 위한 오픈 스킬 생태계의 자동 감사

초록

스킬, 즉 대규모 언어 모델(LLM)을 위해 정제된 구조화된 워크플로우 명령어는 실제 하위 작업에서 에이전트 성능을 향상시키는 점점 더 중요한 메커니즘으로 자리 잡고 있습니다. 그러나 오픈소스 스킬 생태계가 급속도로 확장됨에 따라, 서로 다른 모델과 에이전트 프레임워크가 스킬과 어떻게 상호작용하는지, 스킬 품질을 어떻게 평가해야 하는지, 사용자가 실용적인 비용-성능 트레이드오프 하에서 어떻게 스킬을 선택해야 하는지는 여전히 불명확합니다. 본 논문에서는 스킬 증강 에이전트 시스템과 스킬 자체를 모두 평가하기 위한 자동 평가 프레임워크인 OpenSkillEval을 제시합니다. OpenSkillEval은 정적 벤치마크에 의존하는 대신, 프레젠테이션 생성, 프론트엔드 웹 디자인, 포스터 생성, 데이터 시각화, 보고서 생성 등 다섯 가지 범주의 하위 응용 분야에 걸쳐 진화하는 실제 세계 산출물로부터 사실적인 작업 인스턴스를 자동으로 구축합니다. 또한, 통합된 작업 설정 하에서 통제된 비교를 위해 커뮤니티에서 기여한 스킬을 수집하고 체계화합니다. 600개 이상의 동적으로 생성된 작업 인스턴스와 30개의 오픈소스 스킬을 활용하여 최첨단 모델과 에이전트 프레임워크에 대한 체계적 평가를 수행했습니다. 실험 결과는 스킬 가용성이 효과적인 스킬 사용을 보장하지 않으며, 스킬 증강의 이점이 기반 모델과 에이전트 프레임워크 모두에 크게 의존한다는 점, 그리고 많은 공개적으로 인기 있는 스킬이 스킬이 없는 기본 에이전트보다 일관되게 우수한 성능을 보이지 않는다는 점을 보여줍니다. 이러한 발견은 동적이고 작업에 기반한 평가의 필요성을 강조하며, LLM 에이전트를 위한 스킬의 설계, 선택 및 배포에 대한 실용적인 통찰력을 제공합니다. 추가 사례와 벤치마크 자료는 프로젝트 웹사이트(https://yingjiahao14.github.io/OpenSkillEval-Web/)에서 확인할 수 있습니다.

English

Skills, i.e., structured workflow instructions distilled for large language models (LLMs), are becoming an increasingly important mechanism for improving agent performance on real-world downstream tasks. However, as the open-source skill ecosystem rapidly expands, it remains unclear how different models and agent frameworks interact with skills, how to evaluate skill quality, and how users should select skills under practical cost-performance trade-offs. In this paper, we present OpenSkillEval, an automatic evaluation framework for both skill-augmented agent systems and the skills themselves. Instead of relying on static benchmarks, OpenSkillEval automatically constructs realistic task instances from evolving real-world artifacts across five categories of downstream applications: presentation generation, front-end web design, poster generation, data visualization, and report generation. It further collects and organizes community-contributed skills for controlled comparison under unified task settings. Using more than 600 dynamically generated task instances and 30 open-source skills, we conduct a systematic evaluation of state-of-the-art models and agent frameworks. Our results show that skill availability does not guarantee effective skill usage, that the benefit of skill augmentation depends strongly on both the underlying model and the agent framework, and that many publicly popular skills do not consistently outperform base agents without skills. These findings highlight the need for dynamic, task-grounded evaluation and provide practical insights into the design, selection, and deployment of skills for LLM agents. Additional cases and benchmark resources are available on the project website: https://yingjiahao14.github.io/OpenSkillEval-Web/.