当云端智能体遇上设备端智能体：来自混合型多智能体系统的经验启示

摘要

智能体AI推理的设计空间涵盖两个极端：通常部署在云端、在广泛任务中表现强劲但成本高昂的前沿大语言模型（LLMs），以及成本效益更高、适用于设备端推理的小型语言模型（SLMs）。结合设备端与云端模型的混合多智能体系统（MAS）提供了一个有前景的中间方案，但同时也引入了一个复杂且理解不足的设计空间——在该空间中，任务准确性、货币成本与边缘端能耗紧密耦合；由于缺乏通用设计原则，混合组件虽然并非最普遍的选择，却通常通过针对特定领域的临时决策引入。本研究对该设计空间进行了更系统的考察。我们改编了两种具有代表性的MAS架构以支持混合推理，并探究个体设计选择如何沿能耗、成本与性能的帕累托前沿移动工作点。研究结果揭示了混合MAS设计的微妙图景：尽管SLMs可有效受益于LLMs的协助，但最优架构高度依赖于具体任务，且更高层级的计算能力并不总能转化为更优性能。

English

The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more cost-efficient small language models (SLMs), which are amenable to on-device inference. Hybrid multi-agent systems (MASs) combining on-device and cloud models offer a promising middle ground, but they also introduce a complex and poorly understood design space in which task accuracy, monetary cost, and edge energy consumption are tightly coupled; in the absence of general design principles, hybrid components, although not the most prevalent choice, are typically introduced through ad hoc decisions tailored to specific domains. In this work, we examine this design space more systematically. We adapt two representative MAS architectures to support hybrid inference and study how individual design choices shift the operating point along the Pareto frontier of power, cost, and performance. Our findings paint a nuanced picture of hybrid MAS design: while SLMs can effectively benefit from LLM assistance, the optimal architecture is highly task-dependent, and greater frontier-level compute does not consistently translate to better performance.