CoVEBench: 動画編集モデルは複雑な指示を処理できるか？

要旨

近年のテキスト誘導型動画編集モデルは、スタイル変換やオブジェクト挿入といった基本的なタスクでは優れた性能を発揮するものの、現実のユーザー要求は高度に複合的である。単一のプロンプトには、被写体・動作・カメラ視点の変更など、複数の連動した編集が求められることが多く、その一方で無関係な時空間コンテンツは厳密に保持しなければならない。既存のベンチマークは、単一編集と粗いグローバルメトリクスに強く制約されており、モデルがこのような複雑なワークフローをどのように扱うかを診断できていない。このギャップに対処するため、我々はCoVEBenchを提案する。これは416の厳選されたソース動画、626のマルチポイント編集指示、および9,990の詳細なチェックリスト項目から構成される複合的な動画編集ベンチマークである。多様な編集次元をカバーし、CoVEBenchはMLLMによる指示遵守度と動画忠実度の評価、および動画品質の自動メトリクスを用いてモデルを評価する。広範な実験により、複合的な編集が依然として深い課題であることが明らかになった。現在のモデルは、複数の操作を同時に処理する際に、編集を省略したり、保存制約に違反したり、アーティファクトを導入したりすることが頻繁に見られる。CoVEBenchは、現実的なユーザーワークフローに向けて動画編集を前進させるための、挑戦的かつ診断的なテストベッドを提供する。

English

While recent text-guided video editing models excel at elementary tasks (e.g., style transfer, object insertion), real-world user requests are highly compositional. A single prompt often demands multiple coupled edits, such as modifying subjects, actions, and camera views, while strictly preserving unrelated spatiotemporal content. Existing benchmarks, heavily constrained by isolated edits and coarse global metrics, fail to diagnose how models handle such complex workflows. To address this gap, we introduce CoVEBench, a compositional video editing benchmark comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items. Covering diverse editing dimensions, CoVEBench evaluates models via MLLM-judged instruction compliance and video fidelity, alongside automated metrics for video quality. Extensive experiments reveal that compositional editing remains a profound challenge: current models frequently omit edits, violate preservation constraints, or introduce artifacts when handling multiple operations simultaneously. CoVEBench provides a challenging, diagnostic testbed to advance video editing toward realistic user workflows.