ChronoMagic-Bench:一個用於評估文本轉時間膠片視頻生成的變形評估基準。
ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
June 26, 2024
作者: Shenghai Yuan, Jinfa Huang, Yongqi Xu, Yaoyang Liu, Shaofeng Zhang, Yujun Shi, Ruijie Zhu, Xinhua Cheng, Jiebo Luo, Li Yuan
cs.AI
摘要
我們提出了一個新穎的文本到視頻(T2V)生成基準,ChronoMagic-Bench,用於評估T2V模型(例如Sora和Lumiere)在延時視頻生成中的時間和變形能力。與現有基準不同,這些基準著重於生成視頻的視覺質量和文本相關性,ChronoMagic-Bench則專注於模型生成具有顯著變形幅度和時間連貫性的延時視頻的能力。該基準通過自由形式文本查詢探測T2V模型的物理、生物和化學能力。為此,ChronoMagic-Bench引入了1,649個提示和現實世界視頻作為參考,分為四大類延時視頻:生物、人造、氣象和物理現象,進一步細分為75個子類別。這種分類全面評估了模型處理多樣和複雜變換的能力。為了準確對齊人類偏好與基準,我們引入了兩個新的自動指標,MTScore和CHScore,用於評估視頻的變形屬性和時間連貫性。MTScore衡量變形幅度,反映隨時間變化的程度,而CHScore評估時間連貫性,確保生成的視頻保持邏輯進展和連貫性。基於ChronoMagic-Bench,我們對十個具代表性的T2V模型進行全面手動評估,揭示它們在不同提示類別中的優勢和劣勢,並提供一個全面的評估框架,解決了視頻生成研究中的現有缺口。此外,我們創建了一個大規模的ChronoMagic-Pro數據集,包含460k對720p高質量延時視頻和詳細說明,確保高物理相關性和大變形幅度。
English
We propose a novel text-to-video (T2V) generation benchmark,
ChronoMagic-Bench, to evaluate the temporal and metamorphic capabilities of the
T2V models (e.g. Sora and Lumiere) in time-lapse video generation. In contrast
to existing benchmarks that focus on the visual quality and textual relevance
of generated videos, ChronoMagic-Bench focuses on the model's ability to
generate time-lapse videos with significant metamorphic amplitude and temporal
coherence. The benchmark probes T2V models for their physics, biology, and
chemistry capabilities, in a free-form text query. For these purposes,
ChronoMagic-Bench introduces 1,649 prompts and real-world videos as references,
categorized into four major types of time-lapse videos: biological,
human-created, meteorological, and physical phenomena, which are further
divided into 75 subcategories. This categorization comprehensively evaluates
the model's capacity to handle diverse and complex transformations. To
accurately align human preference with the benchmark, we introduce two new
automatic metrics, MTScore and CHScore, to evaluate the videos' metamorphic
attributes and temporal coherence. MTScore measures the metamorphic amplitude,
reflecting the degree of change over time, while CHScore assesses the temporal
coherence, ensuring the generated videos maintain logical progression and
continuity. Based on the ChronoMagic-Bench, we conduct comprehensive manual
evaluations of ten representative T2V models, revealing their strengths and
weaknesses across different categories of prompts, and providing a thorough
evaluation framework that addresses current gaps in video generation research.
Moreover, we create a large-scale ChronoMagic-Pro dataset, containing 460k
high-quality pairs of 720p time-lapse videos and detailed captions ensuring
high physical pertinence and large metamorphic amplitude.Summary
AI-Generated Summary