스케일링 에이전트 시스템을 위한 과학적 접근

초록

에이전트, 즉 추론, 계획, 행동이 가능한 언어 모델(LM) 기반 시스템은 실생활 AI 애플리케이션의 지배적인 패러다임으로 자리 잡고 있습니다. 이러한 보급에도 불구하고, 그 성능을 결정하는 원칙은 충분히 연구되지 않아 실무자들이 원칙적인 설계 선택보다는 경험적 방법론에 의존해야 하는 상황입니다. 본 연구는 이러한 격차를 해소하기 위해 에이전트 시스템에 대한 정량적 확장 원칙을 도출합니다. 우리는 Finance-Agent, BrowseComp-Plus, PlanCraft, Workbench라는 네 가지 다양한 벤치마크에서 이를 평가했습니다. 세 가지 LLM 계열에 걸쳐 구현된 다섯 가지 주요 아키텍처(단일, 독립, 중앙 집중형, 분산, 하이브리드)를 사용하여 표준화된 도구와 토큰 예산으로 180개 구성에 대한 통제 평가를 수행했습니다. 효율성, 오버헤드, 오류 증폭, 중복성을 포함한 경험적 조정 메트릭을 사용하여 교차 검증된 R²=0.513을 달성하는 예측 모델을 도출했습니다. 우리는 세 가지 주요 효과를 확인했습니다: (1) 도구-조정 상충 관계: 고정된 컴퓨팅 예산 하에서 도구 사용이 많은 작업은 다중 에이전트 오버헤드로 인해 불균형적으로 큰 손실을 입습니다. (2) 능력 포화: 단일 에이전트 기준선이 약 45%를 초과하면 조정을 통한 이익이 체감되거나 오히려 감소합니다(베타=-0.408, p<0.001). (3) 위상에 따른 오류 증폭: 독립 에이전트는 검증되지 않은 전파로 인해 오류를 17.2배 증폭시키는 반면, 중앙 집중형 조정은 이를 4.4배로 억제합니다. 중앙 집중형 조정은 금융 추론과 같은 병렬화 가능한 작업에서 성능을 80.9% 향상시키는 반면, 분산 조정은 동적 웹 탐색에서 뛰어난 성능을 보입니다(+9.2% vs. +0.2%). 그러나 순차적 추론 작업의 경우 모든 다중 에이전트 변형이 성능을 39-70% 저하시켰습니다. 이 프레임워크는 보유된 구성의 87%에 대해 최적의 조정 전략을 예측하며, 측정 가능한 작업 속성에 기반한 에이전트 확장의 예측 원리를 제공합니다.

English

Agents, language model (LM)-based systems that are capable of reasoning, planning, and acting are becoming the dominant paradigm for real-world AI applications. Despite this widespread adoption, the principles that determine their performance remain underexplored, leaving practitioners to rely on heuristics rather than principled design choices. We address this gap by deriving quantitative scaling principles for agent systems. We evaluate this across four diverse benchmarks: Finance-Agent, BrowseComp-Plus, PlanCraft, and Workbench. Using five canonical architectures (Single, Independent, Centralized, Decentralized, Hybrid) instantiated across three LLM families, we perform a controlled evaluation spanning 180 configurations with standardized tools and token budgets. We derive a predictive model using empirical coordination metrics, including efficiency, overhead, error amplification, and redundancy, that achieves cross-validated R^2=0.513. We identify three dominant effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead. (2) a capability saturation: coordination yields diminishing or negative returns (beta=-0.408, p<0.001) once single-agent baselines exceed ~45%. (3) topology-dependent error amplification: independent agents amplify errors 17.2x through unchecked propagation, while centralized coordination contains this to 4.4x. Centralized coordination improves performance by 80.9% on parallelizable tasks like financial reasoning, while decentralized coordination excels on dynamic web navigation (+9.2% vs. +0.2%). Yet for sequential reasoning tasks, all multi-agent variants degraded performance by 39-70%. The framework predicts the optimal coordination strategy for 87% of held-out configurations, providing a predictive principle of agentic scaling based on measurable task properties.

스케일링 에이전트 시스템을 위한 과학적 접근

Towards a Science of Scaling Agent Systems

초록

Support