穩定電影度量學:專業視頻生成的結構化分類與評估
Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation
September 30, 2025
作者: Agneet Chatterjee, Rahim Entezari, Maksym Zhuravinskyi, Maksim Lapin, Reshinth Adithyan, Amit Raj, Chitta Baral, Yezhou Yang, Varun Jampani
cs.AI
摘要
近期,视频生成技术的进步使得基于用户提示的高保真视频合成成为可能。然而,现有模型与基准测试未能充分捕捉专业视频生成的复杂性与需求。为此,我们提出了“稳定电影度量学”(Stable Cinemetrics),一个结构化的评估框架,将电影制作控制正式化为四个解耦且分层的分类体系:场景设置、事件、灯光与摄影。这些分类体系共同定义了76个基于行业实践的细粒度控制节点。利用这些分类体系,我们构建了一个与专业应用场景对齐的提示基准,并开发了一个自动化流程用于提示分类与问题生成,从而实现对每个控制维度的独立评估。我们开展了一项大规模人类研究,涵盖超过10个模型与2万部视频,由80多位电影专业人士进行标注。我们的分析,无论是粗粒度还是细粒度,均揭示出即便当前最强大的模型在事件与摄影相关控制方面仍存在显著差距。为支持可扩展的评估,我们训练了一个自动评估器,这是一个与专家标注对齐的视觉-语言模型,其表现优于现有的零样本基线。SCINE是首个将专业视频生成置于视频生成模型领域中的方法,引入了以电影控制为中心的分类体系,并通过结构化评估流程与详细分析为其提供支持,以指导未来研究。
English
Recent advances in video generation have enabled high-fidelity video
synthesis from user provided prompts. However, existing models and benchmarks
fail to capture the complexity and requirements of professional video
generation. Towards that goal, we introduce Stable Cinemetrics, a structured
evaluation framework that formalizes filmmaking controls into four
disentangled, hierarchical taxonomies: Setup, Event, Lighting, and Camera.
Together, these taxonomies define 76 fine-grained control nodes grounded in
industry practices. Using these taxonomies, we construct a benchmark of prompts
aligned with professional use cases and develop an automated pipeline for
prompt categorization and question generation, enabling independent evaluation
of each control dimension. We conduct a large-scale human study spanning 10+
models and 20K videos, annotated by a pool of 80+ film professionals. Our
analysis, both coarse and fine-grained reveal that even the strongest current
models exhibit significant gaps, particularly in Events and Camera-related
controls. To enable scalable evaluation, we train an automatic evaluator, a
vision-language model aligned with expert annotations that outperforms existing
zero-shot baselines. SCINE is the first approach to situate professional video
generation within the landscape of video generative models, introducing
taxonomies centered around cinematic controls and supporting them with
structured evaluation pipelines and detailed analyses to guide future research.