ChatPaper.aiChatPaper

智能体技能在现实场景中的表现如何:基准测试LLM在真实环境下的技能运用能力

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

April 6, 2026
作者: Yujian Liu, Jiabao Ji, Li An, Tommi Jaakkola, Yang Zhang, Shiyu Chang
cs.AI

摘要

智能体技能作为可复用的领域特定知识构件,已成为扩展基于大语言模型智能体的主流机制,然而对其使用性能进行规范化基准测试的研究仍较为匮乏。现有技能基准测试多聚焦于过度理想化的场景:模型直接获得为每个任务手工定制的高度特化技能,而现实场景中智能体常需自主搜索并选择相关技能,且即使最匹配的技能也可能与任务需求存在偏差。本文首次在渐进式挑战性现实场景下系统研究技能效用,要求智能体从包含3.4万个真实世界技能的大型库中检索技能,且无法获取任何人工筛选技能。研究发现技能优势具有脆弱性:随着场景趋近现实,性能增益持续衰减,在最挑战性场景中通过率趋近无技能基线。为缩小这一差距,我们探索了包括查询特定与查询无关策略的技能优化方法,证明当初始技能具有合理相关性与质量时,查询特定优化能显著恢复性能损失。在Terminal-Bench 2.0上的进一步实验表明,检索与优化策略具有普适性,将Claude Opus 4.6的通过率从57.7%提升至65.5%。跨模型一致性结果既揭示了技能对基于大语言模型智能体的潜力,也凸显了其当前局限性。代码已开源:https://github.com/UCSB-NLP-Chang/Skill-Usage。
English
Agent skills, which are reusable, domain-specific knowledge artifacts, have become a popular mechanism for extending LLM-based agents, yet formally benchmarking skill usage performance remains scarce. Existing skill benchmarking efforts focus on overly idealized conditions, where LLMs are directly provided with hand-crafted, narrowly-tailored task-specific skills for each task, whereas in many realistic settings, the LLM agent may have to search for and select relevant skills on its own, and even the closest matching skills may not be well-tailored for the task. In this paper, we conduct the first comprehensive study of skill utility under progressively challenging realistic settings, where agents must retrieve skills from a large collection of 34k real-world skills and may not have access to any hand-curated skills. Our findings reveal that the benefits of skills are fragile: performance gains degrade consistently as settings become more realistic, with pass rates approaching no-skill baselines in the most challenging scenarios. To narrow this gap, we study skill refinement strategies, including query-specific and query-agnostic approaches, and we show that query-specific refinement substantially recovers lost performance when the initial skills are of reasonable relevance and quality. We further demonstrate the generality of retrieval and refinement on Terminal-Bench 2.0, where they improve the pass rate of Claude Opus 4.6 from 57.7% to 65.5%. Our results, consistent across multiple models, highlight both the promise and the current limitations of skills for LLM-based agents. Our code is available at https://github.com/UCSB-NLP-Chang/Skill-Usage.
PDF241April 9, 2026