迈向规模化智能体系统的科学构建

摘要

基于语言模型的智能体系统——具备推理、规划与行动能力的AI应用范式——正成为现实世界人工智能应用的主导模式。尽管这类系统已被广泛采用，但其性能决定机制仍缺乏深入探索，导致实践者往往依赖经验法则而非系统性设计原则。我们通过推导智能体系统的定量扩展规律来填补这一空白。我们在四个多样化基准测试（Finance-Agent、BrowseComp-Plus、PlanCraft和Workbench）上展开评估，采用五种典型架构（单智能体、独立型、集中式、分布式、混合式）并实例化于三大语言模型家族，通过标准化工具与令牌预算对180种配置进行受控实验。基于效率、开销、错误放大和冗余等协调指标，我们建立了预测模型（交叉验证R²=0.513），揭示出三大主导效应：（1）工具-协调权衡：在固定计算预算下，工具密集型任务会因多智能体开销而显著受损；（2）能力饱和：当单智能体基线性能超过约45%后，协调带来的收益呈递减或负增长（β=-0.408, p<0.001）；（3）拓扑依赖的错误放大：独立智能体通过未检传播将错误放大17.2倍，而集中式协调可控制在4.4倍。在金融推理等可并行任务中，集中式协调使性能提升80.9%；而在动态网络导航任务中，分布式协调表现更优（+9.2% vs +0.2%）。但对于顺序推理任务，所有多智能体变体均导致性能下降39-70%。该框架对87%的保留配置能预测最优协调策略，基于可量化的任务特性为智能体扩展提供了预测性原则。

English

Agents, language model (LM)-based systems that are capable of reasoning, planning, and acting are becoming the dominant paradigm for real-world AI applications. Despite this widespread adoption, the principles that determine their performance remain underexplored, leaving practitioners to rely on heuristics rather than principled design choices. We address this gap by deriving quantitative scaling principles for agent systems. We evaluate this across four diverse benchmarks: Finance-Agent, BrowseComp-Plus, PlanCraft, and Workbench. Using five canonical architectures (Single, Independent, Centralized, Decentralized, Hybrid) instantiated across three LLM families, we perform a controlled evaluation spanning 180 configurations with standardized tools and token budgets. We derive a predictive model using empirical coordination metrics, including efficiency, overhead, error amplification, and redundancy, that achieves cross-validated R^2=0.513. We identify three dominant effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead. (2) a capability saturation: coordination yields diminishing or negative returns (beta=-0.408, p<0.001) once single-agent baselines exceed ~45%. (3) topology-dependent error amplification: independent agents amplify errors 17.2x through unchecked propagation, while centralized coordination contains this to 4.4x. Centralized coordination improves performance by 80.9% on parallelizable tasks like financial reasoning, while decentralized coordination excels on dynamic web navigation (+9.2% vs. +0.2%). Yet for sequential reasoning tasks, all multi-agent variants degraded performance by 39-70%. The framework predicts the optimal coordination strategy for 87% of held-out configurations, providing a predictive principle of agentic scaling based on measurable task properties.

迈向规模化智能体系统的科学构建

Towards a Science of Scaling Agent Systems

摘要

Support