ChatPaper.aiChatPaper

AgencyBench:在百万令牌真实场景下测评自主智能体的前沿能力

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

January 16, 2026
作者: Keyu Li, Junhao Shi, Yang Xiao, Mohan Jiang, Jie Sun, Yunze Wu, Shijie Xia, Xiaojie Cai, Tianze Xu, Weiye Si, Wenjie Li, Dequan Wang, Pengfei Liu
cs.AI

摘要

基于大语言模型的自主智能体展现出多维度能力,可为经济生产做出实质性贡献。然而现有基准测试仍聚焦于单一智能能力,未能捕捉长周期现实场景。此外,依赖人工反馈的现实任务创建了可扩展性瓶颈,阻碍了自动化推演收集与评估。为弥补这一差距,我们推出AgencyBench——一个源自日常AI使用场景的综合基准测试框架,在32个现实场景中评估6项核心智能能力,包含138项具有明确查询要求、交付标准和评分细则的任务。这些场景平均需要90次工具调用、100万token和数小时执行时间才能完成。为实现自动化评估,我们采用用户模拟智能体提供迭代反馈,并通过Docker沙箱进行基于视觉与功能指标的评估。实验表明闭源模型显著优于开源模型(48.4% vs 32.1%)。进一步分析揭示了模型在资源效率、反馈驱动自我修正及特定工具使用偏好方面的显著差异。最后,我们探究了智能体框架的影响,发现专有模型在其原生生态系统中表现更优(如通过Claude-Agent-SDK运行的Claude-4.5-Opus),而开源模型则在不同执行框架中呈现独特性能峰值,暗示其存在针对特定框架的优化潜力。AgencyBench作为下一代智能体的关键测试平台,凸显了模型架构与智能体框架协同优化的必要性。我们相信这项工作为自主智能体的未来发展指明了方向,完整基准测试与评估工具包已发布于https://github.com/GAIR-NLP/AgencyBench。
English
Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric-based assessment. Experiments reveal that closed-source models significantly outperform open-source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback-driven self-correction, and specific tool-use preferences. Finally, we investigate the impact of agentic scaffolds, observing that proprietary models demonstrate superior performance within their native ecosystems (e.g., Claude-4.5-Opus via Claude-Agent-SDK), while open-source models exhibit distinct performance peaks, suggesting potential optimization for specific execution frameworks. AgencyBench serves as a critical testbed for next-generation agents, highlighting the necessity of co-optimizing model architecture with agentic frameworks. We believe this work sheds light on the future direction of autonomous agents, and we release the full benchmark and evaluation toolkit at https://github.com/GAIR-NLP/AgencyBench.
PDF21January 20, 2026