클라우드 에이전트와 디바이스 에이전트의 만남: 하이브리드 멀티 에이전트 시스템에서 얻은 교훈

초록

에이전트형 AI 추론의 설계 공간은 두 극단으로 나뉜다: 일반적으로 클라우드에 호스팅되어 다양한 작업에서 강력한 성능을 제공하지만 비용이 상당히 높은 최첨단 대규모 언어 모델(LLM)과, 온디바이스 추론에 적합한 보다 비용 효율적인 소형 언어 모델(SLM)이 그것이다. 온디바이스 모델과 클라우드 모델을 결합한 하이브리드 다중 에이전트 시스템(MAS)은 유망한 중간 지점을 제공하지만, 작업 정확도, 금전적 비용, 엣지 에너지 소비가 밀접하게 연관된 복잡하고 제대로 이해되지 않은 설계 공간을 도입한다; 일반적인 설계 원칙이 부재한 상황에서, 가장 보편적인 선택은 아니지만 하이브리드 구성 요소는 일반적으로 특정 도메인에 맞춰진 임시방편적 결정을 통해 도입된다. 본 연구에서는 이 설계 공간을 보다 체계적으로 검토한다. 하이브리드 추론을 지원하기 위해 두 가지 대표적인 MAS 아키텍처를 조정하고, 개별 설계 선택이 전력, 비용, 성능의 파레토 프론티어를 따라 운영 지점을 어떻게 이동시키는지 연구한다. 우리의 발견은 하이브리드 MAS 설계에 대한 미묘한 그림을 제시한다: SLM이 LLM의 도움으로부터 효과적으로 이점을 얻을 수 있지만, 최적의 아키텍처는 작업에 크게 의존하며, 더 큰 프런티어 수준의 연산이 일관되게 더 나은 성능으로 이어지지는 않는다.

English

The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more cost-efficient small language models (SLMs), which are amenable to on-device inference. Hybrid multi-agent systems (MASs) combining on-device and cloud models offer a promising middle ground, but they also introduce a complex and poorly understood design space in which task accuracy, monetary cost, and edge energy consumption are tightly coupled; in the absence of general design principles, hybrid components, although not the most prevalent choice, are typically introduced through ad hoc decisions tailored to specific domains. In this work, we examine this design space more systematically. We adapt two representative MAS architectures to support hybrid inference and study how individual design choices shift the operating point along the Pareto frontier of power, cost, and performance. Our findings paint a nuanced picture of hybrid MAS design: while SLMs can effectively benefit from LLM assistance, the optimal architecture is highly task-dependent, and greater frontier-level compute does not consistently translate to better performance.