VBench-2.0：本質的な忠実性のためのビデオ生成ベンチマークスイートの進化

要旨

ビデオ生成技術は大きく進化し、非現実的な出力から、視覚的に説得力があり時間的にも一貫したビデオを生成する段階へと発展してきました。これらのビデオ生成モデルを評価するために、VBenchなどのベンチマークが開発され、フレームごとの美的感覚、時間的整合性、基本的なプロンプトへの忠実度などの要素を測定しています。しかし、これらの側面は主に表面的な忠実度を表しており、ビデオが視覚的に説得力があるかどうかに焦点を当てるもので、現実世界の原則に従っているかどうかは考慮されていません。最近のモデルはこれらの指標でますます良い性能を発揮していますが、視覚的に妥当であるだけでなく、根本的に現実的なビデオを生成するにはまだ課題があります。ビデオ生成を通じて真の「世界モデル」を実現するためには、生成されたビデオが物理法則、常識的推論、解剖学的正確さ、構成的整合性に従うことを保証する内在的忠実度が次のフロンティアとなります。このレベルのリアリズムを達成することは、AI支援映画制作やシミュレートされた世界モデリングなどのアプリケーションにとって不可欠です。このギャップを埋めるために、私たちはビデオ生成モデルの内在的忠実度を自動的に評価する次世代ベンチマークであるVBench-2.0を紹介します。VBench-2.0は、人間の忠実度、制御性、創造性、物理学、常識の5つの主要な次元を評価し、それぞれがさらに細分化された能力に分解されます。個々の次元に合わせた評価フレームワークは、最先端のVLMやLLMなどのジェネラリストと、ビデオ生成のために提案された異常検出方法などのスペシャリストを統合しています。私たちは人間の判断との整合性を確保するために広範なアノテーションを行います。表面的な忠実度を超えて内在的忠実度に向けて推進することにより、VBench-2.0は次世代のビデオ生成モデルの新たな基準を設定することを目指しています。

English

Video generation has advanced significantly, evolving from producing unrealistic outputs to generating videos that appear visually convincing and temporally coherent. To evaluate these video generative models, benchmarks such as VBench have been developed to assess their faithfulness, measuring factors like per-frame aesthetics, temporal consistency, and basic prompt adherence. However, these aspects mainly represent superficial faithfulness, which focus on whether the video appears visually convincing rather than whether it adheres to real-world principles. While recent models perform increasingly well on these metrics, they still struggle to generate videos that are not just visually plausible but fundamentally realistic. To achieve real "world models" through video generation, the next frontier lies in intrinsic faithfulness to ensure that generated videos adhere to physical laws, commonsense reasoning, anatomical correctness, and compositional integrity. Achieving this level of realism is essential for applications such as AI-assisted filmmaking and simulated world modeling. To bridge this gap, we introduce VBench-2.0, a next-generation benchmark designed to automatically evaluate video generative models for their intrinsic faithfulness. VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense, each further broken down into fine-grained capabilities. Tailored for individual dimensions, our evaluation framework integrates generalists such as state-of-the-art VLMs and LLMs, and specialists, including anomaly detection methods proposed for video generation. We conduct extensive annotations to ensure alignment with human judgment. By pushing beyond superficial faithfulness toward intrinsic faithfulness, VBench-2.0 aims to set a new standard for the next generation of video generative models in pursuit of intrinsic faithfulness.

VBench-2.0：本質的な忠実性のためのビデオ生成ベンチマークスイートの進化

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

要旨

Support