PC-Agent：一個面向個人電腦複雜任務自動化的分層多智能體協作框架

摘要

在多模态大語言模型（MLLM）驅動的圖形用戶界面（GUI）代理領域中，相較於智能手機，個人電腦（PC）場景不僅具備更為複雜的交互環境，還涉及更為繁瑣的應用內及應用間工作流程。為應對這些挑戰，我們提出了一種名為PC-Agent的分層代理框架。具體而言，從感知角度出發，我們設計了主動感知模塊（APM），以克服現有MLLM在截圖內容感知能力上的不足。從決策制定角度，為更有效地處理複雜用戶指令及相互依賴的子任務，我們提出了一種分層多代理協作架構，將決策過程分解為指令-子任務-動作三個層次。在此架構內，設置了三個代理（即管理員、進度與決策代理），分別負責指令分解、進度追踪及逐步決策制定。此外，引入反思代理以實現及時的自下而上錯誤反饋與調整。我們還推出了一個包含25條真實世界複雜指令的新基準測試PC-Eval。在PC-Eval上的實驗結果表明，我們的PC-Agent相較於先前最先進的方法，任務成功率提升了32%的絕對值。代碼將公開提供。

English

In the field of MLLM-based GUI agents, compared to smartphones, the PC scenario not only features a more complex interactive environment, but also involves more intricate intra- and inter-app workflows. To address these issues, we propose a hierarchical agent framework named PC-Agent. Specifically, from the perception perspective, we devise an Active Perception Module (APM) to overcome the inadequate abilities of current MLLMs in perceiving screenshot content. From the decision-making perspective, to handle complex user instructions and interdependent subtasks more effectively, we propose a hierarchical multi-agent collaboration architecture that decomposes decision-making processes into Instruction-Subtask-Action levels. Within this architecture, three agents (i.e., Manager, Progress and Decision) are set up for instruction decomposition, progress tracking and step-by-step decision-making respectively. Additionally, a Reflection agent is adopted to enable timely bottom-up error feedback and adjustment. We also introduce a new benchmark PC-Eval with 25 real-world complex instructions. Empirical results on PC-Eval show that our PC-Agent achieves a 32% absolute improvement of task success rate over previous state-of-the-art methods. The code will be publicly available.

PC-Agent：一個面向個人電腦複雜任務自動化的分層多智能體協作框架

PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC

摘要

Support