朝向更佳的文本到影片生成度量标准
Towards A Better Metric for Text-to-Video Generation
January 15, 2024
作者: Jay Zhangjie Wu, Guian Fang, Haoning Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, Yuchao Gu, Rui Zhao, Weisi Lin, Wynne Hsu, Ying Shan, Mike Zheng Shou
cs.AI
摘要
生成模型展現了在合成高品質文本、圖像和影片方面的卓越能力。對於影片生成,當代文本到影片模型展現了令人印象深刻的能力,製作出視覺上令人驚豔的影片。然而,評估這類影片帶來了顯著挑戰。目前的研究主要採用自動化指標,如FVD、IS和CLIP分數。然而,這些指標提供了不完整的分析,特別是在對影片內容進行時間評估時,因此使它們成為真實影片品質的不可靠指標。此外,雖然用戶研究有潛力準確反映人類感知,但受到耗時且費力的性質的阻礙,其結果往往被主觀偏見所影響。在本文中,我們研究現有指標固有的限制,並引入一個新穎的評估流程,即文本到影片分數(T2VScore)。該指標整合了兩個關鍵標準:(1)文本-影片對齊,審查影片在呈現給定文本描述方面的忠實度,以及(2)影片品質,評估影片的整體製作質量,並融合專家的意見。此外,為了評估所提出的指標並促進對其未來改進,我們提出了TVGE數據集,收集了對於兩個標準上的2,543個文本到影片生成的影片的人類判斷。對TVGE數據集的實驗顯示了所提出的T2VScore在提供更好的文本到影片生成指標方面的優越性。
English
Generative models have demonstrated remarkable capability in synthesizing
high-quality text, images, and videos. For video generation, contemporary
text-to-video models exhibit impressive capabilities, crafting visually
stunning videos. Nonetheless, evaluating such videos poses significant
challenges. Current research predominantly employs automated metrics such as
FVD, IS, and CLIP Score. However, these metrics provide an incomplete analysis,
particularly in the temporal assessment of video content, thus rendering them
unreliable indicators of true video quality. Furthermore, while user studies
have the potential to reflect human perception accurately, they are hampered by
their time-intensive and laborious nature, with outcomes that are often tainted
by subjective bias. In this paper, we investigate the limitations inherent in
existing metrics and introduce a novel evaluation pipeline, the Text-to-Video
Score (T2VScore). This metric integrates two pivotal criteria: (1) Text-Video
Alignment, which scrutinizes the fidelity of the video in representing the
given text description, and (2) Video Quality, which evaluates the video's
overall production caliber with a mixture of experts. Moreover, to evaluate the
proposed metrics and facilitate future improvements on them, we present the
TVGE dataset, collecting human judgements of 2,543 text-to-video generated
videos on the two criteria. Experiments on the TVGE dataset demonstrate the
superiority of the proposed T2VScore on offering a better metric for
text-to-video generation.