從模型擴展到系統擴展：擴展代理型AI中的調度框架

摘要

本文研究代理型人工智能的下一个主要瓶颈——系统扩展，而非仅模型扩展：即围绕基础模型设计可审计、持久、模块化且可验证的架构。我们将这一转变称为“系统框架扩展”：将基础模型周围的结构化执行层视为设计、评估与优化的首要对象。尽管近期的大语言模型使智能体能够使用工具、检索信息、维持记忆并执行长周期工作流，但评估仍以模型为中心，往往将智能体简化为最终任务的成功率，而将记忆、检索、工具使用、编排、验证与治理视为次要实现细节。这种框架日益显得不足，因为智能体的性能源于基础模型、记忆底层、语境构建器、技能路由层、编排循环以及验证与治理层之间的交互作用。这些组件共同构成智能体系统框架，将模型能力转化为长周期智能体行为。我们通过三个核心瓶颈研究系统框架扩展：语境治理、可信记忆与动态技能路由，以及协调并约束它们的编排与治理机制。我们进一步勾勒出系统框架层面基准评估的研究议程，这些基准超越一次性任务成功率，转而衡量轨迹质量、记忆卫生、语境效率、通信保真度、验证成本以及随时间推移的安全演进。为使讨论具体化，我们开发了CheetahClaws（https://github.com/SafeRL-Lab/cheetahclaws），一个Python原生参考系统框架，并将其与Claude Code和OpenClaw进行比较。我们的核心主张是：未来代理型人工智能的进展，将同样取决于系统设计与更强大的基础模型。

English

This paper studies the next major bottleneck in agentic AI as system scaling, not only model scaling: the design of auditable, persistent, modular, and verifiable architectures around foundation models. We refer to this shift as scaling the harness: treating the structured execution layer around a foundation model as a first-class object of design, evaluation, and optimization. Although recent large language models enable agents to use tools, retrieve information, maintain memory, and execute long-horizon workflows, evaluation remains largely model-centric, often reducing agents to final-task success while treating memory, retrieval, tool use, orchestration, verification, and governance as secondary implementation details. This framing is increasingly inadequate because agent performance emerges from the interaction among the foundation model, memory substrate, context constructor, skill-routing layer, orchestration loop, and verification-and-governance layer. Together, these components form the agent harness, which translates model capability into long-horizon agent behavior. We study scaling the harness through three core bottlenecks: context governance, trustworthy memory, and dynamic skill routing, together with the orchestration and governance mechanisms that coordinate and constrain them. We further outline a research agenda for harness-level benchmarks that go beyond one-shot task success to measure trajectory quality, memory hygiene, context efficiency, communication fidelity, verification cost, and safe evolution over time. To make the discussion concrete, we develop CheetahClaws: https://github.com/SafeRL-Lab/cheetahclaws, a Python-native reference harness, and compare it with Claude Code and OpenClaw. Our main claim is that future progress in agentic AI will depend as much on system design as on stronger foundation models.