當雲端代理遇上裝置代理：混合多代理系統的啟示

摘要

代理式AI推理的設計空間涵蓋了兩個極端：前沿大型語言模型（LLM，通常部署於雲端，在廣泛任務中展現強大效能但成本極高），以及更具成本效益的小型語言模型（SLM，適合於裝置端推理）。結合裝置端與雲端模型的混合多智能體系統（MAS）提供了一個極具前景的中間方案，但也引入了複雜且理解不足的設計空間——在該空間中，任務準確度、金錢成本與邊緣端能耗三者緊密耦合；由於缺乏通用設計原則，混合組件（儘管並非最常見的選擇）通常透過針對特定領域的臨時決策來引入。在本研究中，我們更系統性地探討此設計空間。我們改編兩種具代表性的MAS架構以支援混合推理，並探討個別設計選擇如何沿著功耗、成本與效能的帕累托前沿移動作業點。我們的研究結果描繪出混合MAS設計的精細圖像：雖然小型語言模型能有效受益於大型語言模型的輔助，但最佳架構高度依賴於任務特性，且越高的前沿級計算能力並不一定轉化為更優的效能。

English

The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more cost-efficient small language models (SLMs), which are amenable to on-device inference. Hybrid multi-agent systems (MASs) combining on-device and cloud models offer a promising middle ground, but they also introduce a complex and poorly understood design space in which task accuracy, monetary cost, and edge energy consumption are tightly coupled; in the absence of general design principles, hybrid components, although not the most prevalent choice, are typically introduced through ad hoc decisions tailored to specific domains. In this work, we examine this design space more systematically. We adapt two representative MAS architectures to support hybrid inference and study how individual design choices shift the operating point along the Pareto frontier of power, cost, and performance. Our findings paint a nuanced picture of hybrid MAS design: while SLMs can effectively benefit from LLM assistance, the optimal architecture is highly task-dependent, and greater frontier-level compute does not consistently translate to better performance.