迈向规模化智能体系统的科学建构
Towards a Science of Scaling Agent Systems
December 9, 2025
作者: Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A. Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Mark Malhotra, Paul Pu Liang, Hae Won Park, Yuzhe Yang, Xuhai Xu, Yilun Du, Shwetak Patel, Tim Althoff, Daniel McDuff, Xin Liu
cs.AI
摘要
基于语言模型的智能体系统——即具备推理、规划与行动能力的人工智能体系——正逐渐成为现实世界AI应用的主流范式。尽管这类系统已被广泛采用,但其性能决定机制的内在原理仍待深入探索,导致实践者往往依赖经验法则而非系统性设计原则。为弥补这一空白,我们推导出智能体系统的量化扩展规律。我们在四个差异化基准测试(Finance-Agent、BrowseComp-Plus、PlanCraft和Workbench)中展开评估,通过三种大语言模型族实例化五种经典架构(单智能体、独立智能体、集中式、分布式及混合式),在标准化工具与令牌预算下对180种配置进行受控实验。利用包含效率、开销、错误放大效应与冗余度在内的实证协调指标,我们构建出交叉验证R²=0.513的预测模型,并揭示三大主导效应:(1)工具-协调权衡:在固定计算预算下,工具密集型任务会因多智能体协调开销而承受不成比例的效能损失;(2)能力饱和效应:当单智能体基线性能超过约45%后,协调机制产生的收益呈边际递减或负增长(β=-0.408, p<0.001);(3)拓扑依赖的错误放大:独立智能体因未受控的错误传播使误差放大17.2倍,而集中式协调可将此限制在4.4倍。在金融推理等可并行任务中,集中式协调使性能提升80.9%;分布式协调则在动态网络导航任务中表现更优(+9.2% vs. +0.2%)。然而对于顺序推理任务,所有多智能体架构均导致性能下降39-70%。该框架对87%的保留配置能预测最优协调策略,基于可量化的任务特性为智能体规模扩展提供了预测性原理。
English
Agents, language model (LM)-based systems that are capable of reasoning, planning, and acting are becoming the dominant paradigm for real-world AI applications. Despite this widespread adoption, the principles that determine their performance remain underexplored, leaving practitioners to rely on heuristics rather than principled design choices. We address this gap by deriving quantitative scaling principles for agent systems. We evaluate this across four diverse benchmarks: Finance-Agent, BrowseComp-Plus, PlanCraft, and Workbench. Using five canonical architectures (Single, Independent, Centralized, Decentralized, Hybrid) instantiated across three LLM families, we perform a controlled evaluation spanning 180 configurations with standardized tools and token budgets. We derive a predictive model using empirical coordination metrics, including efficiency, overhead, error amplification, and redundancy, that achieves cross-validated R^2=0.513. We identify three dominant effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead. (2) a capability saturation: coordination yields diminishing or negative returns (beta=-0.408, p<0.001) once single-agent baselines exceed ~45%. (3) topology-dependent error amplification: independent agents amplify errors 17.2x through unchecked propagation, while centralized coordination contains this to 4.4x. Centralized coordination improves performance by 80.9% on parallelizable tasks like financial reasoning, while decentralized coordination excels on dynamic web navigation (+9.2% vs. +0.2%). Yet for sequential reasoning tasks, all multi-agent variants degraded performance by 39-70%. The framework predicts the optimal coordination strategy for 87% of held-out configurations, providing a predictive principle of agentic scaling based on measurable task properties.