ShowUI-π:基于流的生成模型——图形用户界面的灵巧之手
ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands
December 31, 2025
作者: Siyuan Hu, Kevin Qinghong Lin, Mike Zheng Shou
cs.AI
摘要
构建能够进行灵巧操作的智能体,是实现机器人及数字环境中类人自动化水平的关键。然而,现有GUI智能体依赖离散的点击坐标预测(x,y),这限制了需要连续实时感知与调整的自由形式闭环轨迹操作(如拖动进度条)。本研究开发了ShowUI-π——首个基于流模型的GUI灵巧操作手,其核心设计包括:(一)统一离散-连续动作空间,将离散点击与连续拖拽整合于共享模型中,实现跨交互模式的灵活适配;(二)基于流的拖拽动作生成,通过轻量级动作专家模块从连续视觉观测中预测光标增量调整,确保轨迹平滑稳定;(三)拖拽训练数据与基准测试,我们手动采集并合成了涵盖五大领域(如PowerPoint、Adobe Premiere Pro)的2万条拖拽轨迹,并推出ScreenDrag基准,该基准包含全面的在线与离线评估方案,用于衡量GUI智能体的拖拽能力。实验表明,主流GUI智能体在ScreenDrag上表现仍不理想(如Operator得分13.27,最佳模型Gemini-2.5-CUA仅达22.18),而ShowUI-π以仅4.5亿参数实现了26.98的得分,既凸显了任务难度,也验证了方法的有效性。我们期待此项工作推动GUI智能体在数字世界中实现类人灵巧操控。代码已开源:https://github.com/showlab/showui-pi。
English
Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-π, the first flow-based generative model as GUI dexterous hand, featuring the following designs: (i) Unified Discrete-Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes; (ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories; (iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents' drag capabilities. Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-π achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world. The code is available at https://github.com/showlab/showui-pi.