에이전시벤치: 100만 토큰 현실 세계 컨텍스트에서 자율 에이전트의 최전선 성능 평가

초록

대규모 언어 모델(LLM) 기반 자율 에이전트는 경제 생산에 상당히 기여할 수 있는 다면적인 능력을 보여줍니다. 그러나 기존 벤치마크는 단일 에이전트 능력에 집중된 나머지, 장기적인 실제 시나리오를 포착하지 못하고 있습니다. 더욱이 현실적인 작업을 위해 인간의 피드백에 의존하는 것은 확장성의 병목 현상을 일으켜, 자동화된 롤아웃 수집 및 평가를 방해합니다. 이러한 격차를 해소하기 위해 우리는 일상적인 AI 사용에서 도출한 포괄적인 벤치마크인 AgencyBench를 소개합니다. 이는 32개의 실제 시나리오에서 6가지 핵심 에이전트 능력을 평가하며, 구체적인 질의, 산출물, 평가 기준을 포함한 138개의 작업으로 구성됩니다. 이러한 시나리오를 해결하려면 평균 90회의 도구 호출, 100만 토큰, 그리고 수 시간의 실행 시간이 필요합니다. 자동화된 평가를 가능하게 하기 위해, 우리는 반복적 피드백을 제공하는 사용자 시뮬레이션 에이전트와 시각적 및 기능적 평가 기준 기반 평가를 수행하는 Docker 샌드박스를 활용합니다. 실험 결과, 클로즈드 소스 모델이 오픈 소스 모델을 크게 능가하는 것으로 나타났습니다(48.4% 대 32.1%). 추가 분석을 통해 자원 효율성, 피드백 기반 자기 수정, 특정 도구 사용 선호도에 있어 모델 간 상당한 차이가 있음을 확인했습니다. 마지막으로, 우리는 에이전트 스캐폴드의 영향을 조사한 결과, 독점 모델은 자체 생태계 내에서 더 우수한 성능을 보이는 반면(예: Claude-Agent-SDK를 통한 Claude-4.5-Opus), 오픈 소스 모델은 특정 실행 프레임워크에 대해 뚜렷한 성능 정점을 보여 특정 실행 프레임워크에 대한 최적화 가능성을 시사합니다. AgencyBench는 차세대 에이전트를 위한 중요한 테스트베드 역할을 하며, 모델 아키텍처와 에이전트 프레임워크를 함께 최적화할 필요성을 강조합니다. 우리는 이 작업이 자율 에이전트의 미래 방향을 제시한다고 믿으며, 전체 벤치마크와 평가 도구 키트를 https://github.com/GAIR-NLP/AgencyBench 에 공개합니다.

English

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric-based assessment. Experiments reveal that closed-source models significantly outperform open-source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback-driven self-correction, and specific tool-use preferences. Finally, we investigate the impact of agentic scaffolds, observing that proprietary models demonstrate superior performance within their native ecosystems (e.g., Claude-4.5-Opus via Claude-Agent-SDK), while open-source models exhibit distinct performance peaks, suggesting potential optimization for specific execution frameworks. AgencyBench serves as a critical testbed for next-generation agents, highlighting the necessity of co-optimizing model architecture with agentic frameworks. We believe this work sheds light on the future direction of autonomous agents, and we release the full benchmark and evaluation toolkit at https://github.com/GAIR-NLP/AgencyBench.

에이전시벤치: 100만 토큰 현실 세계 컨텍스트에서 자율 에이전트의 최전선 성능 평가

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

초록

Support