더 나은 텍스트-비디오 생성 메트릭을 향하여

초록

생성 모델은 고품질의 텍스트, 이미지, 비디오를 합성하는 데 있어서 놀라운 능력을 보여주고 있다. 비디오 생성 분야에서 최신 텍스트-투-비디오 모델은 시각적으로 뛰어난 비디오를 제작하며 인상적인 성능을 보인다. 그러나 이러한 비디오를 평가하는 것은 상당한 어려움을 동반한다. 현재 연구는 주로 FVD, IS, CLIP Score와 같은 자동화된 지표를 사용하고 있다. 하지만 이러한 지표들은 비디오 콘텐츠의 시간적 평가 측면에서 불완전한 분석을 제공하며, 이로 인해 진정한 비디오 품질을 신뢰할 수 있는 지표로 사용하기에는 한계가 있다. 또한, 사용자 연구는 인간의 인식을 정확히 반영할 잠재력을 가지고 있지만, 시간과 노력이 많이 소요되며 결과가 주관적 편향에 의해 오염되는 경우가 많다. 본 논문에서는 기존 지표들의 한계를 조사하고, 새로운 평가 파이프라인인 텍스트-투-비디오 점수(T2VScore)를 소개한다. 이 지표는 두 가지 핵심 기준을 통합한다: (1) 텍스트-비디오 정렬, 이는 주어진 텍스트 설명을 비디오가 얼마나 충실히 표현하는지를 검토하며, (2) 비디오 품질, 이는 전문가들의 혼합 평가를 통해 비디오의 전반적인 제작 수준을 평가한다. 또한, 제안된 지표를 평가하고 향후 개선을 촉진하기 위해 TVGE 데이터셋을 제시한다. 이 데이터셋은 두 가지 기준에 대해 2,543개의 텍스트-투-비디오 생성 비디오에 대한 인간의 판단을 수집하였다. TVGE 데이터셋에 대한 실험은 제안된 T2VScore가 텍스트-투-비디오 생성을 위한 더 나은 지표를 제공함을 입증한다.

English

Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. Nonetheless, evaluating such videos poses significant challenges. Current research predominantly employs automated metrics such as FVD, IS, and CLIP Score. However, these metrics provide an incomplete analysis, particularly in the temporal assessment of video content, thus rendering them unreliable indicators of true video quality. Furthermore, while user studies have the potential to reflect human perception accurately, they are hampered by their time-intensive and laborious nature, with outcomes that are often tainted by subjective bias. In this paper, we investigate the limitations inherent in existing metrics and introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore). This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video's overall production caliber with a mixture of experts. Moreover, to evaluate the proposed metrics and facilitate future improvements on them, we present the TVGE dataset, collecting human judgements of 2,543 text-to-video generated videos on the two criteria. Experiments on the TVGE dataset demonstrate the superiority of the proposed T2VScore on offering a better metric for text-to-video generation.

더 나은 텍스트-비디오 생성 메트릭을 향하여

Towards A Better Metric for Text-to-Video Generation

초록

Support