EvalCrafter：大型視頻生成模型的基準測試和評估

摘要

近年來，視覺與語言生成模型已經蓬勃發展。對於視頻生成，各種開源模型和公開服務已被釋出，用於生成高視覺質量的視頻。然而，這些方法通常使用一些學術指標，例如FVD或IS，來評估性能。我們認為從簡單的指標來判斷大型條件生成模型是困難的，因為這些模型通常是在非常大的數據集上訓練的，具有多方面的能力。因此，我們提出了一個新的框架和流程，來全面評估生成的視頻的性能。為了實現這一目標，我們首先通過分析真實世界的提示列表，借助大型語言模型，來進行文本到視頻生成的新提示列表。然後，我們根據視覺質量、內容質量、運動質量以及文本-標題對齊等約18個客觀指標，在我們精心設計的基準測試上評估最先進的視頻生成模型。為了獲得模型的最終排行榜，我們還擬合了一系列係數，將客觀指標與用戶意見對齊。基於所提出的意見對齊方法，我們的最終得分顯示出比簡單平均指標更高的相關性，展示了所提出的評估方法的有效性。

English

The vision and language generative models have been overgrown in recent years. For video generation, various open-sourced models and public-available services are released for generating high-visual quality videos. However, these methods often use a few academic metrics, for example, FVD or IS, to evaluate the performance. We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities. Thus, we propose a new framework and pipeline to exhaustively evaluate the performance of the generated videos. To achieve this, we first conduct a new prompt list for text-to-video generation by analyzing the real-world prompt list with the help of the large language model. Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmarks, in terms of visual qualities, content qualities, motion qualities, and text-caption alignment with around 18 objective metrics. To obtain the final leaderboard of the models, we also fit a series of coefficients to align the objective metrics to the users' opinions. Based on the proposed opinion alignment method, our final score shows a higher correlation than simply averaging the metrics, showing the effectiveness of the proposed evaluation method.

EvalCrafter：大型視頻生成模型的基準測試和評估

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

摘要

Support