CoVEBench：影片編輯模型能否處理複雜指令？

摘要

尽管近期基于文本引导的视频编辑模型在基础任务（如风格迁移、物体插入）上表现出色，但现实用户的请求往往高度组合化。单一提示词通常需要执行多项耦合编辑操作，例如同时修改主体、动作和镜头视角，同时严格保留无关的时空内容。现有基准测试受限于孤立编辑和粗粒度的全局指标，无法诊断模型如何处理此类复杂工作流。为填补这一空白，我们提出CoVEBench——一个组合式视频编辑基准测试，包含416个精心挑选的源视频、626条多点编辑指令及9,990个细粒度检查清单条目。覆盖多样化的编辑维度，CoVEBench通过多模态大语言模型（MLLM）评判的指令遵循度与视频保真度，以及自动化视频质量指标来评估模型。大量实验表明，组合式编辑仍是一个深刻挑战：当前模型在处理多操作同步时，常出现编辑遗漏、保真约束违反或引入伪影等问题。CoVEBench提供了一个具有挑战性的诊断测试平台，推动视频编辑向真实用户工作流迈进。

English

While recent text-guided video editing models excel at elementary tasks (e.g., style transfer, object insertion), real-world user requests are highly compositional. A single prompt often demands multiple coupled edits, such as modifying subjects, actions, and camera views, while strictly preserving unrelated spatiotemporal content. Existing benchmarks, heavily constrained by isolated edits and coarse global metrics, fail to diagnose how models handle such complex workflows. To address this gap, we introduce CoVEBench, a compositional video editing benchmark comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items. Covering diverse editing dimensions, CoVEBench evaluates models via MLLM-judged instruction compliance and video fidelity, alongside automated metrics for video quality. Extensive experiments reveal that compositional editing remains a profound challenge: current models frequently omit edits, violate preservation constraints, or introduce artifacts when handling multiple operations simultaneously. CoVEBench provides a challenging, diagnostic testbed to advance video editing toward realistic user workflows.