일반 LLM 에이전트의 벤치마크 테스트 타임 스케일링

초록

LLM 에이전트는 개방형 사용자 요청을 해결할 수 있는 범용 시스템으로서의 기능이 점점 더 기대되고 있습니다. 기존 벤치마크는 특화된 에이전트 개발을 위한 도메인 인식 환경에 초점을 맞추고 있으나, 범용 에이전트를 평가하려면 통합 환경 내에서 여러 기술과 도구를 활용하며 운영하도록 요구하는 보다 현실적인 설정이 필요합니다. 본 논문에서는 검색, 코딩, 추론, 도구 사용 영역에 걸쳐 일반 LLM 에이전트를 평가하기 위한 통합 프레임워크인 General AgentBench를 소개합니다. General AgentBench를 활용하여 순차적 스케일링(반복적 상호작용)과 병렬 스케일링(다중 경로 샘플링) 하에서의 테스트 시점 스케일링 동작을 체계적으로 연구합니다. 10개의 주요 LLM 에이전트에 대한 평가 결과, 도메인 특화 평가에서 이 범용 에이전트 설정으로 이동할 때 성능이 현저히 저하되는 것으로 나타났습니다. 더욱이 두 가지 근본적인 한계—순차적 스케일링의 컨텍스트 한계와 병렬 스케일링의 검증 격차—로 인해 실제로는 어느 스케일링 방법론도 효과적인 성능 향상을 가져오지 못함을 확인했습니다. 코드는 https://github.com/cxcscmu/General-AgentBench에서 공개적으로 이용 가능합니다.

English

LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains. Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting. Moreover, we find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential scaling and verification gap in parallel scaling. Code is publicly available at https://github.com/cxcscmu/General-AgentBench.

일반 LLM 에이전트의 벤치마크 테스트 타임 스케일링

Benchmark Test-Time Scaling of General LLM Agents

초록

Support