VideoGUI:一个用于从教学视频中自动化GUI的基准测试
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
June 14, 2024
作者: Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou
cs.AI
摘要
图形用户界面(GUI)自动化在提高人类生产力方面具有重要潜力,通过协助完成计算机任务。现有任务制定主要集中在可以通过单个仅包含语言指令来指定的简单任务上,例如“插入新幻灯片”。在这项工作中,我们引入了VideoGUI,这是一个新颖的多模态基准,旨在评估视觉为中心的GUI任务上的GUI助手。我们的基准源自高质量的网络教学视频,重点关注涉及专业和新颖软件(例如Adobe Photoshop或Stable Diffusion WebUI)以及复杂活动(例如视频编辑)的任务。VideoGUI通过分层过程评估GUI助手,允许识别它们可能失败的具体级别:(i)高级规划:从视觉条件中重建程序性子任务,而无需语言描述;(ii)中级规划:根据视觉状态(即屏幕截图)和目标生成精确动作序列的动作叙述;(iii)原子动作执行:执行诸如准确点击指定元素之类的具体动作。对于每个级别,我们设计了跨个别维度的评估指标,以提供清晰的信号,例如在原子动作执行中点击、拖动、输入和滚动的个别性能。我们在VideoGUI上的评估显示,即使是SoTA大型多模态模型GPT4o在视觉为中心的GUI任务上表现不佳,特别是在高级规划方面。
English
Graphical User Interface (GUI) automation holds significant promise for
enhancing human productivity by assisting with computer tasks. Existing task
formulations primarily focus on simple tasks that can be specified by a single,
language-only instruction, such as "Insert a new slide." In this work, we
introduce VideoGUI, a novel multi-modal benchmark designed to evaluate GUI
assistants on visual-centric GUI tasks. Sourced from high-quality web
instructional videos, our benchmark focuses on tasks involving professional and
novel software (e.g., Adobe Photoshop or Stable Diffusion WebUI) and complex
activities (e.g., video editing). VideoGUI evaluates GUI assistants through a
hierarchical process, allowing for identification of the specific levels at
which they may fail: (i) high-level planning: reconstruct procedural subtasks
from visual conditions without language descriptions; (ii) middle-level
planning: generate sequences of precise action narrations based on visual state
(i.e., screenshot) and goals; (iii) atomic action execution: perform specific
actions such as accurately clicking designated elements. For each level, we
design evaluation metrics across individual dimensions to provide clear
signals, such as individual performance in clicking, dragging, typing, and
scrolling for atomic action execution. Our evaluation on VideoGUI reveals that
even the SoTA large multimodal model GPT4o performs poorly on visual-centric
GUI tasks, especially for high-level planning.