GPA：通过演示学习图形用户界面流程自动化

摘要

GUI流程自动化(GPA)是一种轻量级但通用的基于视觉的机器人流程自动化(RPA)技术，仅需单次演示即可实现快速稳定的流程回放。针对传统RPA的脆弱性和当前基于视觉语言模型的GUI代理的非确定性风险，GPA具备三大核心优势：(1)通过基于序贯蒙特卡洛的定位技术处理界面缩放和检测不确定性，实现鲁棒性；(2)通过就绪状态校准确保确定性与可靠性；(3)通过快速全本地执行保障隐私安全。该方法为企业工作流提供了所需的适应性、鲁棒性和安全性。GPA还可作为MCP/CLI工具被具备编码能力的其他智能体调用，实现智能体专注决策编排而GPA负责GUI执行的分工模式。我们通过对比实验发现，在完成长周期GUI任务时，GPA相比Gemini 3 Pro（配备CUA工具）成功率更高，且执行速度提升10倍。

English

GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readiness calibration; and (3) Privacy through fast, fully local execution. This approach delivers the adaptability, robustness, and security required for enterprise workflows. It can also be used as an MCP/CLI tool by other agents with coding capabilities so that the agent only reasons and orchestrates while GPA handles the GUI execution. We conducted a pilot experiment to compare GPA with Gemini 3 Pro (with CUA tools) and found that GPA achieves higher success rate with 10 times faster execution speed in finishing long-horizon GUI tasks.