ChatPaper.aiChatPaper

VBench-2.0:提升視頻生成基準套件以實現內在真實性

VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

March 27, 2025
作者: Dian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, Yu Qiao, Ziwei Liu
cs.AI

摘要

影片生成技術已取得顯著進展,從產出不真實的結果,演進到能夠生成視覺上令人信服且時間上連貫的影片。為了評估這些影片生成模型,已開發出如VBench等基準測試,用以衡量其忠實度,包括每幀美學、時間一致性及基本提示遵循等因素。然而,這些方面主要代表表層的忠實度,關注的是影片是否在視覺上令人信服,而非是否遵循現實世界的原則。儘管近期模型在這些指標上表現越來越好,它們仍難以生成不僅視覺上合理,而且根本上真實的影片。要通過影片生成實現真正的「世界模型」,下一個前沿在於內在的忠實度,確保生成的影片遵循物理定律、常識推理、解剖學正確性及構圖完整性。達到這種層次的真實性對於AI輔助電影製作及模擬世界建模等應用至關重要。為彌補這一差距,我們推出了VBench-2.0,這是一個旨在自動評估影片生成模型內在忠實度的新一代基準測試。VBench-2.0評估五個關鍵維度:人類逼真度、可控性、創造力、物理性及常識性,每個維度進一步細分為精細的能力。針對各個維度,我們的評估框架整合了如最先進的視覺語言模型(VLMs)和大型語言模型(LLMs)等通才,以及專為影片生成提出的異常檢測方法等專才。我們進行了廣泛的註釋,以確保與人類判斷的一致性。通過超越表層忠實度,邁向內在忠實度,VBench-2.0旨在為追求內在忠實度的下一代影片生成模型設定新標準。
English
Video generation has advanced significantly, evolving from producing unrealistic outputs to generating videos that appear visually convincing and temporally coherent. To evaluate these video generative models, benchmarks such as VBench have been developed to assess their faithfulness, measuring factors like per-frame aesthetics, temporal consistency, and basic prompt adherence. However, these aspects mainly represent superficial faithfulness, which focus on whether the video appears visually convincing rather than whether it adheres to real-world principles. While recent models perform increasingly well on these metrics, they still struggle to generate videos that are not just visually plausible but fundamentally realistic. To achieve real "world models" through video generation, the next frontier lies in intrinsic faithfulness to ensure that generated videos adhere to physical laws, commonsense reasoning, anatomical correctness, and compositional integrity. Achieving this level of realism is essential for applications such as AI-assisted filmmaking and simulated world modeling. To bridge this gap, we introduce VBench-2.0, a next-generation benchmark designed to automatically evaluate video generative models for their intrinsic faithfulness. VBench-2.0 assesses five key dimensions: Human Fidelity, Controllability, Creativity, Physics, and Commonsense, each further broken down into fine-grained capabilities. Tailored for individual dimensions, our evaluation framework integrates generalists such as state-of-the-art VLMs and LLMs, and specialists, including anomaly detection methods proposed for video generation. We conduct extensive annotations to ensure alignment with human judgment. By pushing beyond superficial faithfulness toward intrinsic faithfulness, VBench-2.0 aims to set a new standard for the next generation of video generative models in pursuit of intrinsic faithfulness.

Summary

AI-Generated Summary

PDF332March 28, 2025