ChatPaper.aiChatPaper

从模型扩展到系统扩展:智能体AI中管控框架的扩展

From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

May 25, 2026
作者: Shangding Gu
cs.AI

摘要

本文研究了智能体人工智能的下一个主要瓶颈在于系统扩展,而非仅模型扩展:即围绕基础模型设计可审计、持久化、模块化且可验证的架构。我们将这一转变称为“扩展架构”:将基础模型周围的结构化执行层视为设计、评估与优化的首要对象。尽管近期的大语言模型已使智能体能够使用工具、检索信息、维护记忆并执行长周期工作流,但评估方法仍以模型为中心,常常将智能体简化为最终任务的成功率,而将记忆、检索、工具使用、编排、验证与治理视为次要的实现细节。这种框架日益不充分,因为智能体性能源自基础模型、记忆基底、上下文构建器、技能路由层、编排循环以及验证与治理层之间的交互。这些组件共同构成智能体架构,将模型能力转化为长周期智能体行为。我们通过三个核心瓶颈研究架构扩展:上下文治理、可信记忆与动态技能路由,以及协调并约束它们的编排与治理机制。我们进一步概述了架构级基准的研究议程,这些基准超越单次任务成功率,衡量轨迹质量、记忆卫生、上下文效率、通信保真度、验证成本及随时间推移的安全演化。为使讨论具体化,我们开发了CheetahClaws(https://github.com/SafeRL-Lab/cheetahclaws):一个原生Python参考架构,并将其与Claude Code和OpenClaw进行比较。我们的核心主张是:未来智能体人工智能的进步将同样依赖于系统设计与更强的基础模型。
English
This paper studies the next major bottleneck in agentic AI as system scaling, not only model scaling: the design of auditable, persistent, modular, and verifiable architectures around foundation models. We refer to this shift as scaling the harness: treating the structured execution layer around a foundation model as a first-class object of design, evaluation, and optimization. Although recent large language models enable agents to use tools, retrieve information, maintain memory, and execute long-horizon workflows, evaluation remains largely model-centric, often reducing agents to final-task success while treating memory, retrieval, tool use, orchestration, verification, and governance as secondary implementation details. This framing is increasingly inadequate because agent performance emerges from the interaction among the foundation model, memory substrate, context constructor, skill-routing layer, orchestration loop, and verification-and-governance layer. Together, these components form the agent harness, which translates model capability into long-horizon agent behavior. We study scaling the harness through three core bottlenecks: context governance, trustworthy memory, and dynamic skill routing, together with the orchestration and governance mechanisms that coordinate and constrain them. We further outline a research agenda for harness-level benchmarks that go beyond one-shot task success to measure trajectory quality, memory hygiene, context efficiency, communication fidelity, verification cost, and safe evolution over time. To make the discussion concrete, we develop CheetahClaws: https://github.com/SafeRL-Lab/cheetahclaws, a Python-native reference harness, and compare it with Claude Code and OpenClaw. Our main claim is that future progress in agentic AI will depend as much on system design as on stronger foundation models.