VEFX-Bench：通用视频编辑与视觉特效综合评测基准

摘要

随着AI辅助视频创作日益普及，基于指令的视频编辑技术已成为精修生成或实拍素材以满足专业需求的关键环节。然而该领域仍面临两大空白：既缺乏包含完整编辑案例的大规模人工标注数据集，也缺少用于横向比较编辑系统的标准化评估体系。现有资源受限于规模狭小、编辑成品缺失或人工质量标签不足，而当前评估方法往往依赖高成本的人工检查或未针对编辑质量优化的通用视觉语言模型。我们推出VEFX数据集——包含5,049个视频编辑案例的人工标注资源，涵盖9大编辑类别和32个子类，每个案例均从指令遵循度、渲染质量、编辑专属性三个解耦维度进行标注。基于该数据集，我们提出专用于视频编辑质量评估的奖励模型VEFX-Reward。该模型通过联合处理源视频、编辑指令与编辑成品，采用序数回归预测多维度质量分数。我们同步发布VEFX-Bench基准测试集，包含300个精选视频-指令对，用于标准化比较不同编辑系统。实验表明，在标准图像质量评估/视频质量评估指标及分组偏好评估中，VEFX-Reward相较通用VLM评估器及现有奖励模型与人类判断具有更高一致性。借助该评估器对代表性商业及开源视频编辑系统进行测试，发现当前模型在视觉合理性、指令遵循度与编辑局部性方面仍存在显著差距。

English

As AI-assisted video creation becomes increasingly practical, instruction-guided video editing has become essential for refining generated or captured footage to meet professional requirements. Yet the field still lacks both a large-scale human-annotated dataset with complete editing examples and a standardized evaluator for comparing editing systems. Existing resources are limited by small scale, missing edited outputs, or the absence of human quality labels, while current evaluation often relies on expensive manual inspection or generic vision-language model judges that are not specialized for editing quality. We introduce VEFX-Dataset, a human-annotated dataset containing 5,049 video editing examples across 9 major editing categories and 32 subcategories, each labeled along three decoupled dimensions: Instruction Following, Rendering Quality, and Edit Exclusivity. Building on VEFX-Dataset, we propose VEFX-Reward, a reward model designed specifically for video editing quality assessment. VEFX-Reward jointly processes the source video, the editing instruction, and the edited video, and predicts per-dimension quality scores via ordinal regression. We further release VEFX-Bench, a benchmark of 300 curated video-prompt pairs for standardized comparison of editing systems. Experiments show that VEFX-Reward aligns more strongly with human judgments than generic VLM judges and prior reward models on both standard IQA/VQA metrics and group-wise preference evaluation. Using VEFX-Reward as an evaluator, we benchmark representative commercial and open-source video editing systems, revealing a persistent gap between visual plausibility, instruction following, and edit locality in current models.