CutVerse: メディアポストプロダクション編集のための構成的GUIエージェントベンチマーク

要旨

GUIエージェントはウェブナビゲーションや基本的なOS操作において顕著な進歩を遂げているものの、専門的なクリエイティブワークフローにおけるその能力は、いまだ十分に探求されていません。このギャップを埋めるため、我々はCutverseを導入します。これは、現実的なメディアポストプロダクション環境において自律型GUIエージェントを体系的に評価するために設計されたベンチマークです。我々は、7つのプロフェッショナルアプリケーション（例：Premiere Pro、Photoshop）にわたる専門家のデモンストレーションを厳選し、実際の編集ワークフローに基づく186の複雑で長期的なタスクをカバーしています。これらのタスクは、高密度なマルチモーダルインターフェースと密接に結合されたインタラクションシーケンスを伴います。スケーラブルな評価を支援するため、我々は軽量なパーサーを開発しました。これは、生の画面録画と低レベルのインタラクションログを、正確なグラウンディングを備えた構造化・構成的なGUIアクション軌跡に変換します。広範な評価の結果、既存のエージェントは現実的なメディア編集タスクにおいて36.0%のタスク成功率しか達成しておらず、我々のベンチマークにおける複雑で長期的なメディアポストプロダクションワークフローがもたらす課題が浮き彫りになりました。現在のモデルは、有望な空間グラウンディング、マルチモーダルアライメント、および協調的なアクション実行を示していますが、長期的な信頼性とドメイン固有の計画立案においては依然として限界があります。

English

While GUI agents have made significant progress in web navigation and basic operating system tasks, their capabilities in professional creative workflows remain largely underexplored. To bridge this gap, we introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows, involving dense multimodal interfaces and tightly coupled interaction sequences. To support scalable evaluation, we develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Extensive evaluations reveal that existing agents achieve only 36.0\% task success on realistic media editing tasks, underscoring the challenges posed by complex, long-horizon media post-production workflows in our benchmark.While current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution, they remain limited in long-horizon reliability and domain-specific planning.