CutVerse：一個用於媒體後期製作剪輯的組合式GUI智能體基準測試

摘要

虽然GUI代理在网页导航和基本操作系统任务方面取得了显著进展，其在专业创意工作流中的能力仍未得到充分探索。为填补这一空白，我们提出Cutverse——一个旨在系统评估自主GUI代理在真实媒体后期制作环境中表现的基准测试。我们整理了7款专业应用（如Premiere Pro、Photoshop）中的专家演示，涵盖186项基于真实编辑流程的复杂长周期任务，涉及密集的多模态界面与高度耦合的交互序列。为了实现可扩展的评估，我们开发了一个轻量级解析器，将原始屏幕录制和低层交互日志转化为结构化的、组合式GUI动作轨迹，并具备精确的定位能力。广泛评估显示，现有代理在真实媒体编辑任务中仅达到36.0%的成功率，这凸显了我们的基准测试中复杂、长周期媒体后期制作工作流所带来的挑战。尽管当前模型在空间定位、多模态对齐和协调动作执行方面展现出潜力，但在长周期可靠性和领域特定规划方面仍存在局限。

English

While GUI agents have made significant progress in web navigation and basic operating system tasks, their capabilities in professional creative workflows remain largely underexplored. To bridge this gap, we introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows, involving dense multimodal interfaces and tightly coupled interaction sequences. To support scalable evaluation, we develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Extensive evaluations reveal that existing agents achieve only 36.0\% task success on realistic media editing tasks, underscoring the challenges posed by complex, long-horizon media post-production workflows in our benchmark.While current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution, they remain limited in long-horizon reliability and domain-specific planning.