API代理與GUI代理：分歧與融合

摘要

大型語言模型（LLMs）已從單純的文本生成進化到驅動軟體代理，直接將自然語言指令轉化為具體行動。雖然基於API的LLM代理最初因其強大的自動化能力和與程序端點的無縫集成而嶄露頭角，但多模態LLM研究的最新進展使得基於GUI的LLM代理能夠以類似人類的方式與圖形用戶界面互動。儘管這兩種範式都旨在實現LLM驅動的任務自動化，但它們在架構複雜性、開發工作流程和用戶互動模式上存在顯著差異。本文首次對基於API和基於GUI的LLM代理進行了全面比較研究，系統地分析了它們的分歧和潛在的融合點。我們探討了關鍵維度，並強調了混合方法可以利用它們互補優勢的場景。通過提出清晰的決策標準並展示實際用例，我們旨在指導從業者和研究者在選擇、結合或轉換這些範式時做出明智的決定。最終，我們指出，LLM基於自動化的持續創新將模糊API驅動和GUI驅動代理之間的界限，為廣泛的現實應用中更靈活、適應性更強的解決方案鋪平道路。

English

Large language models (LLMs) have evolved beyond simple text generation to power software agents that directly translate natural language commands into tangible actions. While API-based LLM agents initially rose to prominence for their robust automation capabilities and seamless integration with programmatic endpoints, recent progress in multimodal LLM research has enabled GUI-based LLM agents that interact with graphical user interfaces in a human-like manner. Although these two paradigms share the goal of enabling LLM-driven task automation, they diverge significantly in architectural complexity, development workflows, and user interaction models. This paper presents the first comprehensive comparative study of API-based and GUI-based LLM agents, systematically analyzing their divergence and potential convergence. We examine key dimensions and highlight scenarios in which hybrid approaches can harness their complementary strengths. By proposing clear decision criteria and illustrating practical use cases, we aim to guide practitioners and researchers in selecting, combining, or transitioning between these paradigms. Ultimately, we indicate that continuing innovations in LLM-based automation are poised to blur the lines between API- and GUI-driven agents, paving the way for more flexible, adaptive solutions in a wide range of real-world applications.

API代理與GUI代理：分歧與融合

API Agents vs. GUI Agents: Divergence and Convergence

摘要

Support