야생 환경에서 에이전트 기술은 얼마나 잘 작동하는가: 현실적인 설정에서의 LLM 기술 활용 벤치마킹

초록

재사용 가능한 도메인 특화 지식 아티팩트인 에이전트 스킬은 LLM 기반 에이전트를 확장하는 인기 있는 메커니즘이 되었으나, 공식적으로 스킬 사용 성능을 벤치마킹한 연구는 여전히 부족합니다. 기존 스킬 벤치마킹 연구는 각 작업에 대해 수작업으로精心制作된, 매우 좁게 특화된 작업 전용 스킬을 LLM에 직접 제공하는 지나치게 이상화된 조건에 집중하는 반면, 많은 현실적 설정에서는 LLM 에이전트가 관련 스킬을 스스로 검색하고 선택해야 하며, 가장 근접하게 일치하는 스킬조차 작업에 잘 맞지 않을 수 있습니다. 본 논문에서는 에이전트가 34k개의 실제 스킬 대규모 컬렉션에서 스킬을 검색(retrieve)해야 하고 수작업으로 선별된 스킬에 접근하지 못할 수 있는 점진적으로 어려워지는 현실적 설정 하에서 스킬 유용성에 대한 첫 번째 포괄적 연구를 수행합니다. 우리의 연구 결과는 스킬의 이점이 취약함을 보여줍니다: 성능 향상은 설정이 더 현실적으로 변함에 따라 지속적으로 감소하며, 가장 어려운 시나리오에서는 통과율(pass rate)이 스킬 없음 기준선(baseline)에 근접합니다. 이 격차를 줄이기 위해 질의 특화적 및 질의 독립적 접근법을 포함한 스킬 정제(skill refinement) 전략을 연구하고, 초기 스킬이 합리적인 관련성과 품질을 가질 때 질의 특화적 정제가 손실된 성능을 상당 부분 회복함을 보입니다. 우리는 Terminal-Bench 2.0에서 검색과 정제의 일반성을 추가로 입증하며, 이를 통해 Claude Opus 4.6의 통과율을 57.7%에서 65.5%로 향상시킵니다. 여러 모델에서 일관된 우리의 결과는 LLM 기반 에이전트를 위한 스킬의 가능성과 현재의 한계를 모두 강조합니다. 우리의 코드는 https://github.com/UCSB-NLP-Chang/Skill-Usage에서 확인할 수 있습니다.

English

Agent skills, which are reusable, domain-specific knowledge artifacts, have become a popular mechanism for extending LLM-based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly provided with hand-crafted, narrowly-tailored task-specific skills for each task, whereas in many realistic settings, the LLM agent may have to search for and select relevant skills on its own, and even the closest matching skills may not be well-tailored for the task. In this paper, we conduct the first comprehensive study of skill utility under progressively challenging realistic settings, where agents must retrieve skills from a large collection of 34k real-world skills and may not have access to any hand-curated skills. Our findings reveal that the benefits of skills are fragile: performance gains degrade consistently as settings become more realistic, with pass rates approaching no-skill baselines in the most challenging scenarios. To narrow this gap, we study skill refinement strategies, including query-specific and query-agnostic approaches, and we show that query-specific refinement substantially recovers lost performance when the initial skills are of reasonable relevance and quality. We further demonstrate the generality of retrieval and refinement on Terminal-Bench 2.0, where they improve the pass rate of Claude Opus 4.6 from 57.7% to 65.5%. Our results, consistent across multiple models, highlight both the promise and the current limitations of skills for LLM-based agents. Our code is available at https://github.com/UCSB-NLP-Chang/Skill-Usage.

야생 환경에서 에이전트 기술은 얼마나 잘 작동하는가: 현실적인 설정에서의 LLM 기술 활용 벤치마킹

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

초록

Support