CutVerse: 미디어 후반 작업 편집을 위한 구성적 GUI 에이전트 벤치마크

초록

GUI 에이전트는 웹 탐색 및 기본 운영 체제 작업에서 상당한 진전을 이루었지만, 전문 창의적 워크플로우에서의 역량은 여전히 크게 탐구되지 않은 상태이다. 이러한 격차를 해소하기 위해, 우리는 Cutverse를 소개한다. 이는 사실적인 미디어 후반 작업 환경에서 자율 GUI 에이전트를 체계적으로 평가하도록 설계된 벤치마크이다. 우리는 7개의 전문 응용 프로그램(예: Premiere Pro, Photoshop)에 걸쳐 전문가 데모를 선별하였으며, 이는 실제 편집 워크플로우에 기반한 186개의 복잡하고 장기적인 과제를 포함하며, 밀집된 멀티모달 인터페이스와 긴밀하게 결합된 상호작용 시퀀스를 수반한다. 확장 가능한 평가를 지원하기 위해, 우리는 원시 화면 녹화 및 저수준 상호작용 로그를 정밀한 근거를 갖춘 구조화된 조합형 GUI 행동 궤적으로 변환하는 경량 파서를 개발하였다. 광범위한 평가 결과, 기존 에이전트는 사실적인 미디어 편집 작업에서 36.0%의 작업 성공률만을 달성하여, 우리 벤치마크에서 복잡하고 장기적인 미디어 후반 작업 워크플로우가 제기하는 도전 과제를 강조한다. 현재 모델은 유망한 공간적 근거, 멀티모달 정렬 및 조정된 동작 실행을 보여주지만, 장기적 신뢰성과 도메인 특화 계획 측면에서는 여전히 제한적이다.

English

While GUI agents have made significant progress in web navigation and basic operating system tasks, their capabilities in professional creative workflows remain largely underexplored. To bridge this gap, we introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows, involving dense multimodal interfaces and tightly coupled interaction sequences. To support scalable evaluation, we develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Extensive evaluations reveal that existing agents achieve only 36.0\% task success on realistic media editing tasks, underscoring the challenges posed by complex, long-horizon media post-production workflows in our benchmark.While current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution, they remain limited in long-horizon reliability and domain-specific planning.