EvalCrafter：大型视频生成模型的基准测试与评估

摘要

近年来，视觉和语言生成模型已经迅速发展。对于视频生成，各种开源模型和公开服务已发布，用于生成高视觉质量的视频。然而，这些方法通常使用一些学术指标，例如FVD或IS，来评估性能。我们认为很难仅通过简单指标来评判大型条件生成模型，因为这些模型通常是在非常庞大的数据集上训练的，具有多方面的能力。因此，我们提出了一个新的框架和流程，以全面评估生成视频的性能。为实现这一目标，我们首先通过分析真实世界提示列表，借助大型语言模型，制定了一个新的文本到视频生成的提示列表。然后，我们根据视觉质量、内容质量、动态质量以及文本-标题对齐等约18个客观指标，在我们精心设计的基准测试上评估最先进的视频生成模型。为获得模型的最终排行榜，我们还拟合了一系列系数，将客观指标与用户意见进行对齐。根据提出的意见对齐方法，我们的最终得分显示出比简单平均指标更高的相关性，展示了所提出的评估方法的有效性。

English

The vision and language generative models have been overgrown in recent years. For video generation, various open-sourced models and public-available services are released for generating high-visual quality videos. However, these methods often use a few academic metrics, for example, FVD or IS, to evaluate the performance. We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities. Thus, we propose a new framework and pipeline to exhaustively evaluate the performance of the generated videos. To achieve this, we first conduct a new prompt list for text-to-video generation by analyzing the real-world prompt list with the help of the large language model. Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmarks, in terms of visual qualities, content qualities, motion qualities, and text-caption alignment with around 18 objective metrics. To obtain the final leaderboard of the models, we also fit a series of coefficients to align the objective metrics to the users' opinions. Based on the proposed opinion alignment method, our final score shows a higher correlation than simply averaging the metrics, showing the effectiveness of the proposed evaluation method.

EvalCrafter：大型视频生成模型的基准测试与评估

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

摘要

Support