朝向更佳的文本到影片生成度量标准

摘要

生成模型展現了在合成高品質文本、圖像和影片方面的卓越能力。對於影片生成，當代文本到影片模型展現了令人印象深刻的能力，製作出視覺上令人驚豔的影片。然而，評估這類影片帶來了顯著挑戰。目前的研究主要採用自動化指標，如FVD、IS和CLIP分數。然而，這些指標提供了不完整的分析，特別是在對影片內容進行時間評估時，因此使它們成為真實影片品質的不可靠指標。此外，雖然用戶研究有潛力準確反映人類感知，但受到耗時且費力的性質的阻礙，其結果往往被主觀偏見所影響。在本文中，我們研究現有指標固有的限制，並引入一個新穎的評估流程，即文本到影片分數（T2VScore）。該指標整合了兩個關鍵標準：（1）文本-影片對齊，審查影片在呈現給定文本描述方面的忠實度，以及（2）影片品質，評估影片的整體製作質量，並融合專家的意見。此外，為了評估所提出的指標並促進對其未來改進，我們提出了TVGE數據集，收集了對於兩個標準上的2,543個文本到影片生成的影片的人類判斷。對TVGE數據集的實驗顯示了所提出的T2VScore在提供更好的文本到影片生成指標方面的優越性。

English

Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. Nonetheless, evaluating such videos poses significant challenges. Current research predominantly employs automated metrics such as FVD, IS, and CLIP Score. However, these metrics provide an incomplete analysis, particularly in the temporal assessment of video content, thus rendering them unreliable indicators of true video quality. Furthermore, while user studies have the potential to reflect human perception accurately, they are hampered by their time-intensive and laborious nature, with outcomes that are often tainted by subjective bias. In this paper, we investigate the limitations inherent in existing metrics and introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore). This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video's overall production caliber with a mixture of experts. Moreover, to evaluate the proposed metrics and facilitate future improvements on them, we present the TVGE dataset, collecting human judgements of 2,543 text-to-video generated videos on the two criteria. Experiments on the TVGE dataset demonstrate the superiority of the proposed T2VScore on offering a better metric for text-to-video generation.

朝向更佳的文本到影片生成度量标准

Towards A Better Metric for Text-to-Video Generation

摘要

Support