稳定电影度量学:专业视频生成的结构化分类与评估体系
Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation
September 30, 2025
作者: Agneet Chatterjee, Rahim Entezari, Maksym Zhuravinskyi, Maksim Lapin, Reshinth Adithyan, Amit Raj, Chitta Baral, Yezhou Yang, Varun Jampani
cs.AI
摘要
近期视频生成技术的进步使得从用户提供的提示中合成高保真视频成为可能。然而,现有模型和基准测试未能充分捕捉专业视频生成的复杂性和需求。为此,我们引入了稳定电影度量(Stable Cinemetrics),这是一个结构化的评估框架,将电影制作控制形式化为四个解耦的、层次化的分类体系:场景设置、事件、灯光和摄像机。这些分类体系共同定义了基于行业实践的76个细粒度控制节点。利用这些分类体系,我们构建了一个与专业用例对齐的提示基准,并开发了一个自动化的提示分类和问题生成流程,从而能够独立评估每个控制维度。我们进行了一项大规模的人类研究,涵盖10多个模型和2万多个视频,由80多位电影专业人士进行标注。我们的分析,无论是粗粒度还是细粒度,都揭示了即使当前最强的模型也存在显著差距,特别是在事件和摄像机相关控制方面。为了实现可扩展的评估,我们训练了一个自动评估器,这是一个与专家标注对齐的视觉-语言模型,其表现优于现有的零样本基线。SCINE是首个将专业视频生成置于视频生成模型领域的方法,引入了以电影控制为中心的分类体系,并通过结构化评估流程和详细分析为其提供支持,以指导未来研究。
English
Recent advances in video generation have enabled high-fidelity video
synthesis from user provided prompts. However, existing models and benchmarks
fail to capture the complexity and requirements of professional video
generation. Towards that goal, we introduce Stable Cinemetrics, a structured
evaluation framework that formalizes filmmaking controls into four
disentangled, hierarchical taxonomies: Setup, Event, Lighting, and Camera.
Together, these taxonomies define 76 fine-grained control nodes grounded in
industry practices. Using these taxonomies, we construct a benchmark of prompts
aligned with professional use cases and develop an automated pipeline for
prompt categorization and question generation, enabling independent evaluation
of each control dimension. We conduct a large-scale human study spanning 10+
models and 20K videos, annotated by a pool of 80+ film professionals. Our
analysis, both coarse and fine-grained reveal that even the strongest current
models exhibit significant gaps, particularly in Events and Camera-related
controls. To enable scalable evaluation, we train an automatic evaluator, a
vision-language model aligned with expert annotations that outperforms existing
zero-shot baselines. SCINE is the first approach to situate professional video
generation within the landscape of video generative models, introducing
taxonomies centered around cinematic controls and supporting them with
structured evaluation pipelines and detailed analyses to guide future research.