VideoGUI: 지침 비디오를 통한 GUI 자동화를 위한 벤치마크

초록

그래픽 사용자 인터페이스(GUI) 자동화는 컴퓨터 작업을 지원함으로써 인간의 생산성을 향상시킬 수 있는 상당한 잠재력을 가지고 있습니다. 기존의 작업 정의는 주로 "새 슬라이드 삽입"과 같이 단일 언어 지시로 명시할 수 있는 간단한 작업에 초점을 맞추고 있습니다. 본 연구에서는 시각 중심의 GUI 작업을 평가하기 위해 새로운 다중 모달 벤치마크인 VideoGUI를 소개합니다. 고품질 웹 교육 비디오에서 수집된 이 벤치마크는 Adobe Photoshop이나 Stable Diffusion WebUI와 같은 전문적이고 새로운 소프트웨어 및 비디오 편집과 같은 복잡한 활동을 포함하는 작업에 중점을 둡니다. VideoGUI는 GUI 어시스턴트를 계층적 프로세스를 통해 평가하여, 실패할 수 있는 특정 수준을 식별할 수 있도록 합니다: (i) 상위 수준 계획: 언어 설명 없이 시각적 조건에서 절차적 하위 작업을 재구성; (ii) 중간 수준 계획: 시각적 상태(예: 스크린샷)와 목표를 기반으로 정확한 액션 설명 시퀀스 생성; (iii) 원자적 액션 실행: 지정된 요소를 정확히 클릭하는 것과 같은 특정 액션 수행. 각 수준에 대해, 우리는 클릭, 드래그, 타이핑, 스크롤과 같은 원자적 액션 실행에서의 개별 성능과 같은 명확한 신호를 제공하기 위해 개별 차원에 걸쳐 평가 지표를 설계했습니다. VideoGUI에 대한 평가 결과, 최신 대형 다중 모달 모델인 GPT4o조차도 시각 중심의 GUI 작업, 특히 상위 수준 계획에서 낮은 성능을 보이는 것으로 나타났습니다.

English

Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks. Existing task formulations primarily focus on simple tasks that can be specified by a single, language-only instruction, such as "Insert a new slide." In this work, we introduce VideoGUI, a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks. Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software (e.g., Adobe Photoshop or Stable Diffusion WebUI) and complex activities (e.g., video editing). VideoGUI evaluates GUI assistants through a hierarchical process, allowing for identification of the specific levels at which they may fail: (i) high-level planning: reconstruct procedural subtasks from visual conditions without language descriptions; (ii) middle-level planning: generate sequences of precise action narrations based on visual state (i.e., screenshot) and goals; (iii) atomic action execution: perform specific actions such as accurately clicking designated elements. For each level, we design evaluation metrics across individual dimensions to provide clear signals, such as individual performance in clicking, dragging, typing, and scrolling for atomic action execution. Our evaluation on VideoGUI reveals that even the SoTA large multimodal model GPT4o performs poorly on visual-centric GUI tasks, especially for high-level planning.

VideoGUI: 지침 비디오를 통한 GUI 자동화를 위한 벤치마크

VideoGUI: A Benchmark for GUI Automation from Instructional Videos

초록

Support