T2V-CompBench:一個針對組合式文本到視頻生成的全面基準。
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation
July 19, 2024
作者: Kaiyue Sun, Kaiyi Huang, Xian Liu, Yue Wu, Zihan Xu, Zhenguo Li, Xihui Liu
cs.AI
摘要
文字到影片(T2V)生成模型已有顯著進展,然而它們將不同物件、屬性、動作和動作組合成影片的能力仍未被探索。先前的文字到影片基準測試也忽略了這一重要能力的評估。在這項工作中,我們進行了第一次對組合式文字到影片生成進行系統研究。我們提出了T2V-CompBench,這是專為組合式文字到影片生成量身定制的第一個基準測試。
T2V-CompBench包含了組合性的多個方面,包括一致的屬性綁定、動態屬性綁定、空間關係、動作綁定、物件互動和生成數量。我們進一步精心設計了基於MLLM的評估指標、基於檢測的指標和基於追踪的指標,這些指標可以更好地反映出七個提出的類別中700個文字提示的組合式文字到影片生成質量。所提出的指標的有效性已通過與人類評估的相關性得到驗證。我們還對各種文字到影片生成模型進行基準測試,並在不同模型和不同組合式類別之間進行深入分析。我們發現,對於當前的模型來說,組合式文字到影片生成是非常具有挑戰性的,我們希望我們的嘗試能為未來在這個方向上的研究提供一些啟示。
English
Text-to-video (T2V) generation models have advanced significantly, yet their
ability to compose different objects, attributes, actions, and motions into a
video remains unexplored. Previous text-to-video benchmarks also neglect this
important ability for evaluation. In this work, we conduct the first systematic
study on compositional text-to-video generation. We propose T2V-CompBench, the
first benchmark tailored for compositional text-to-video generation.
T2V-CompBench encompasses diverse aspects of compositionality, including
consistent attribute binding, dynamic attribute binding, spatial relationships,
motion binding, action binding, object interactions, and generative numeracy.
We further carefully design evaluation metrics of MLLM-based metrics,
detection-based metrics, and tracking-based metrics, which can better reflect
the compositional text-to-video generation quality of seven proposed categories
with 700 text prompts. The effectiveness of the proposed metrics is verified by
correlation with human evaluations. We also benchmark various text-to-video
generative models and conduct in-depth analysis across different models and
different compositional categories. We find that compositional text-to-video
generation is highly challenging for current models, and we hope that our
attempt will shed light on future research in this direction.Summary
AI-Generated Summary