CoVEBench：视频编辑模型能否处理复杂指令？

摘要

尽管近期基于文本引导的视频编辑模型在基础任务（如风格迁移、对象插入）上表现出色，但现实用户需求往往具有高度组合性。单一提示词常要求多项耦合编辑，例如修改主体、动作和拍摄视角，同时严格保留无关的时空内容。现有基准受限于孤立的编辑操作和粗粒度的全局指标，无法有效诊断模型处理此类复杂工作流的能力。为填补这一空白，我们提出CoVEBench——一个组合式视频编辑基准，包含416段精选源视频、626条多点编辑指令及9,990个细粒度检查项。该基准覆盖多样化的编辑维度，通过多模态大语言模型评判指令遵循度与视频保真度，并结合自动化指标评估视频质量。大量实验表明，组合式编辑仍是一项重大挑战：当前模型在处理多重并发操作时，常出现编辑遗漏、约束违背或伪影引入等问题。CoVEBench作为一个具有挑战性的诊断性测试平台，致力于推动视频编辑向贴近真实用户工作流的方向发展。

English

While recent text-guided video editing models excel at elementary tasks (e.g., style transfer, object insertion), real-world user requests are highly compositional. A single prompt often demands multiple coupled edits, such as modifying subjects, actions, and camera views, while strictly preserving unrelated spatiotemporal content. Existing benchmarks, heavily constrained by isolated edits and coarse global metrics, fail to diagnose how models handle such complex workflows. To address this gap, we introduce CoVEBench, a compositional video editing benchmark comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items. Covering diverse editing dimensions, CoVEBench evaluates models via MLLM-judged instruction compliance and video fidelity, alongside automated metrics for video quality. Extensive experiments reveal that compositional editing remains a profound challenge: current models frequently omit edits, violate preservation constraints, or introduce artifacts when handling multiple operations simultaneously. CoVEBench provides a challenging, diagnostic testbed to advance video editing toward realistic user workflows.