テキストからビデオ生成モデルの人間評価プロトコル再考：信頼性、再現性、実用性の向上

要旨

近年、Gen2、Pika、Soraなどのモデルによって示されたテキストからビデオ（T2V）技術の進歩は、その適用範囲と人気を大幅に拡大しました。しかし、これらの進歩にもかかわらず、これらのモデルを評価することは大きな課題を抱えています。主に、自動評価指標の限界のため、T2V生成の評価には手動評価が優れた方法とされています。しかし、既存の手動評価プロトコルは再現性、信頼性、実用性の問題に直面しています。これらの課題に対処するため、本論文ではテキストからビデオ人間評価（T2VHE）プロトコルを紹介します。これはT2Vモデルのための包括的かつ標準化されたプロトコルです。T2VHEプロトコルは、明確に定義された評価指標、徹底したアノテーターのトレーニング、効果的な動的評価モジュールを含んでいます。実験結果は、このプロトコルが高品質のアノテーションを保証するだけでなく、評価コストを約50％削減できることを示しています。T2VHEプロトコルの全体設定をオープンソース化します。これには、完全なプロトコルワークフロー、動的評価コンポーネントの詳細、アノテーションインターフェースコードが含まれます。これにより、コミュニティがより洗練された人間評価プロトコルを確立するのに役立つでしょう。

English

Recent text-to-video (T2V) technology advancements, as demonstrated by models such as Gen2, Pika, and Sora, have significantly broadened its applicability and popularity. Despite these strides, evaluating these models poses substantial challenges. Primarily, due to the limitations inherent in automatic metrics, manual evaluation is often considered a superior method for assessing T2V generation. However, existing manual evaluation protocols face reproducibility, reliability, and practicality issues. To address these challenges, this paper introduces the Text-to-Video Human Evaluation (T2VHE) protocol, a comprehensive and standardized protocol for T2V models. The T2VHE protocol includes well-defined metrics, thorough annotator training, and an effective dynamic evaluation module. Experimental results demonstrate that this protocol not only ensures high-quality annotations but can also reduce evaluation costs by nearly 50%. We will open-source the entire setup of the T2VHE protocol, including the complete protocol workflow, the dynamic evaluation component details, and the annotation interface code. This will help communities establish more sophisticated human assessment protocols.

テキストからビデオ生成モデルの人間評価プロトコル再考：信頼性、再現性、実用性の向上

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

要旨

Support