CoVEBench: 비디오 편집 모델이 복잡한 명령어를 처리할 수 있는가?

초록

최근 텍스트 기반 비디오 편집 모델은 기본적인 작업(예: 스타일 전환, 객체 삽입)에서 뛰어난 성능을 보이지만, 실제 사용자 요청은 매우 구성적이다. 단일 프롬프트는 종종 주제, 동작, 카메라 시점 수정과 같은 여러 결합된 편집을 요구하며, 관련 없는 시공간 콘텐츠는 엄격히 보존해야 한다. 기존 벤치마크는 고립된 편집과 거친 전역 지표에 크게 제약되어, 모델이 이러한 복잡한 워크플로를 어떻게 처리하는지 진단하지 못한다. 이러한 격차를 해소하기 위해, 우리는 416개의 엄선된 원본 비디오, 626개의 다중 지점 편집 지침, 9,990개의 세분화된 체크리스트 항목으로 구성된 구성적 비디오 편집 벤치마크인 CoVEBench를 소개한다. 다양한 편집 차원을 포괄하는 CoVEBench는 MLLM이 판단하는 지침 준수 및 비디오 충실도와 함께 비디오 품질에 대한 자동화된 지표를 통해 모델을 평가한다. 광범위한 실험 결과, 구성적 편집은 여전히 심각한 과제로 남아 있다: 현재 모델은 여러 작업을 동시에 처리할 때 편집을 자주 생략하거나 보존 제약을 위반하거나 인공물을 발생시킨다. CoVEBench는 비디오 편집을 현실적인 사용자 워크플로로 발전시키기 위한 도전적이고 진단적인 테스트베드를 제공한다.

English

While recent text-guided video editing models excel at elementary tasks (e.g., style transfer, object insertion), real-world user requests are highly compositional. A single prompt often demands multiple coupled edits, such as modifying subjects, actions, and camera views, while strictly preserving unrelated spatiotemporal content. Existing benchmarks, heavily constrained by isolated edits and coarse global metrics, fail to diagnose how models handle such complex workflows. To address this gap, we introduce CoVEBench, a compositional video editing benchmark comprising 416 curated source videos, 626 multi-point editing instructions, and 9,990 fine-grained checklist items. Covering diverse editing dimensions, CoVEBench evaluates models via MLLM-judged instruction compliance and video fidelity, alongside automated metrics for video quality. Extensive experiments reveal that compositional editing remains a profound challenge: current models frequently omit edits, violate preservation constraints, or introduce artifacts when handling multiple operations simultaneously. CoVEBench provides a challenging, diagnostic testbed to advance video editing toward realistic user workflows.