EvalCrafter: 大規模動画生成モデルのベンチマーキングと評価

要旨

近年、視覚と言語の生成モデルが急速に発展しています。ビデオ生成においては、高品質なビデオを生成するための様々なオープンソースモデルや公開サービスがリリースされています。しかし、これらの手法はしばしばFVDやISといった学術的な指標を用いて性能を評価しています。我々は、大規模な条件付き生成モデルを単純な指標で判断することは難しいと主張します。なぜなら、これらのモデルは非常に大規模なデータセットでトレーニングされ、多面的な能力を備えているからです。そこで、我々は生成されたビデオの性能を徹底的に評価するための新しいフレームワークとパイプラインを提案します。これを実現するために、まず大規模言語モデルの助けを借りて現実世界のプロンプトリストを分析し、テキストからビデオ生成のための新しいプロンプトリストを作成します。次に、視覚品質、内容品質、動きの品質、テキストキャプションの整合性といった観点から、約18の客観的指標を用いて最先端のビデオ生成モデルを慎重に設計されたベンチマークで評価します。モデルの最終的なリーダーボードを得るために、客観的指標をユーザーの意見に合わせるための一連の係数もフィッティングします。提案された意見整合手法に基づいて、我々の最終スコアは単に指標を平均するよりも高い相関を示し、提案された評価方法の有効性を実証しています。

English

The vision and language generative models have been overgrown in recent years. For video generation, various open-sourced models and public-available services are released for generating high-visual quality videos. However, these methods often use a few academic metrics, for example, FVD or IS, to evaluate the performance. We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities. Thus, we propose a new framework and pipeline to exhaustively evaluate the performance of the generated videos. To achieve this, we first conduct a new prompt list for text-to-video generation by analyzing the real-world prompt list with the help of the large language model. Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmarks, in terms of visual qualities, content qualities, motion qualities, and text-caption alignment with around 18 objective metrics. To obtain the final leaderboard of the models, we also fit a series of coefficients to align the objective metrics to the users' opinions. Based on the proposed opinion alignment method, our final score shows a higher correlation than simply averaging the metrics, showing the effectiveness of the proposed evaluation method.

EvalCrafter: 大規模動画生成モデルのベンチマーキングと評価

EvalCrafter: Benchmarking and Evaluating Large Video Generation Models

要旨

Support