CutVerse：面向媒体后期制作编辑的组合式GUI代理基准测试

摘要

虽然GUI代理在网页导航和基础操作系统任务方面取得了显著进展，但其在专业创意工作流中的能力仍基本未被探索。为弥补这一空白，我们提出了Cutverse——一个旨在真实媒体后期制作环境中系统性评估自主GUI代理的基准测试。我们整理了横跨7个专业应用（如Premiere Pro、Photoshop）的专家演示，涵盖186项基于真实编辑流程的复杂长时任务，涉及密集的多模态接口与紧密耦合的交互序列。为支持可扩展评估，我们开发了一个轻量级解析器，将原始屏幕录制和低层级交互日志转化为结构化的、组合式GUI动作轨迹，并实现精确的接地。广泛评估显示，现有代理在真实媒体编辑任务中仅达到36.0%的任务成功率，这凸显了我们基准测试中复杂长时媒体后期制作工作流所带来的挑战。尽管当前模型在空间接地、多模态对齐和协调动作执行方面展现出潜力，但它们在长期可靠性和领域特定规划方面仍存在局限。

English

While GUI agents have made significant progress in web navigation and basic operating system tasks, their capabilities in professional creative workflows remain largely underexplored. To bridge this gap, we introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows, involving dense multimodal interfaces and tightly coupled interaction sequences. To support scalable evaluation, we develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Extensive evaluations reveal that existing agents achieve only 36.0\% task success on realistic media editing tasks, underscoring the challenges posed by complex, long-horizon media post-production workflows in our benchmark.While current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution, they remain limited in long-horizon reliability and domain-specific planning.