GPA:通过演示学习图形用户界面流程自动化
GPA: Learning GUI Process Automation from Demonstrations
April 2, 2026
作者: Zirui Zhao, Jun Hao Liew, Yan Yang, Wenzhuo Yang, Ziyang Luo, Doyen Sahoo, Silvio Savarese, Junnan Li
cs.AI
摘要
GUI流程自动化(GPA)是一种轻量级但通用的基于视觉的机器人流程自动化(RPA)技术,仅需单次演示即可实现快速稳定的流程回放。针对传统RPA的脆弱性和当前基于视觉语言模型的GUI代理的非确定性风险,GPA具备三大核心优势:(1)通过基于序贯蒙特卡洛的定位技术处理界面缩放和检测不确定性,实现鲁棒性;(2)通过就绪状态校准确保确定性与可靠性;(3)通过快速全本地执行保障隐私安全。该方法为企业工作流提供了所需的适应性、鲁棒性和安全性。GPA还可作为MCP/CLI工具被具备编码能力的其他智能体调用,实现智能体专注决策编排而GPA负责GUI执行的分工模式。我们通过对比实验发现,在完成长周期GUI任务时,GPA相比Gemini 3 Pro(配备CUA工具)成功率更高,且执行速度提升10倍。
English
GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readiness calibration; and (3) Privacy through fast, fully local execution. This approach delivers the adaptability, robustness, and security required for enterprise workflows. It can also be used as an MCP/CLI tool by other agents with coding capabilities so that the agent only reasons and orchestrates while GPA handles the GUI execution. We conducted a pilot experiment to compare GPA with Gemini 3 Pro (with CUA tools) and found that GPA achieves higher success rate with 10 times faster execution speed in finishing long-horizon GUI tasks.