AgencyBench:在百万令牌真实场景中评测自主智能体的前沿能力
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
January 16, 2026
作者: Keyu Li, Junhao Shi, Yang Xiao, Mohan Jiang, Jie Sun, Yunze Wu, Shijie Xia, Xiaojie Cai, Tianze Xu, Weiye Si, Wenjie Li, Dequan Wang, Pengfei Liu
cs.AI
摘要
基于大语言模型(LLM)的自主智能体展现出多维度赋能经济生产的潜力,但现有基准测试仍聚焦单一智能能力,难以模拟长周期现实场景。此外,依赖人工反馈的真实任务评估存在可扩展性瓶颈,阻碍了自动化流程的推进。为弥补这一空白,我们推出AgencyBench——一个源自日常AI应用场景的综合评估体系,通过32个现实场景中的138项任务(含具体查询、交付物和评估标准),系统评估6大核心智能能力。这些场景平均需调用90次工具、处理100万token并耗费数小时执行时间。为实现自动化评估,我们采用用户模拟智能体提供迭代反馈,并通过Docker沙箱进行基于视觉与功能指标的量化评估。实验表明:闭源模型表现显著优于开源模型(48.4% vs 32.1%)。深入分析揭示了模型在资源效率、反馈驱动自我修正及特定工具使用偏好等方面的显著差异。最后,我们探究了智能体框架的影响,发现专有模型在其原生生态中表现更优(如通过Claude-Agent-SDK调用的Claude-4.5-Opus),而开源模型在不同执行框架下呈现差异化性能峰值,暗示其存在特定优化空间。AgencyBench作为新一代智能体的关键测试平台,揭示了模型架构与智能体框架协同优化的必要性。本研究为自主智能体的发展指明方向,完整基准与评估工具包已发布于https://github.com/GAIR-NLP/AgencyBench。
English
Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric-based assessment. Experiments reveal that closed-source models significantly outperform open-source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback-driven self-correction, and specific tool-use preferences. Finally, we investigate the impact of agentic scaffolds, observing that proprietary models demonstrate superior performance within their native ecosystems (e.g., Claude-4.5-Opus via Claude-Agent-SDK), while open-source models exhibit distinct performance peaks, suggesting potential optimization for specific execution frameworks. AgencyBench serves as a critical testbed for next-generation agents, highlighting the necessity of co-optimizing model architecture with agentic frameworks. We believe this work sheds light on the future direction of autonomous agents, and we release the full benchmark and evaluation toolkit at https://github.com/GAIR-NLP/AgencyBench.