T2AV-Compass: テキストからの音声・映像生成のための統合的評価に向けて

要旨

Text-to-Audio-Video（T2AV）生成は、自然言語から時間的に一貫性のあるビデオと意味的に同期した音声を合成することを目的とするが、その評価は断片的であり、単一モダリティの指標や範囲が限定されたベンチマークに依存することが多く、複雑なプロンプト下でのクロスモーダル整合性、指示追従性、知覚的リアリズムを十分に捉えられていない。この課題に対処するため、我々はT2AVシステムを包括的に評価する統合ベンチマーク「T2AV-Compass」を提案する。これは、意味的豊かさと物理的妥当性を確保するため分類体系に基づくパイプラインで構築された500の多様で複雑なプロンプトから構成される。さらにT2AV-Compassは、映像品質・音声品質・クロスモーダル整合性を客観的信号レベルで評価する指標と、指示追従性とリアリズムを主観的に評価するMLLM-as-a-Judgeプロトコルを統合した二重評価フレームワークを導入する。代表的な11のT2AVシステムを用いた大規模評価により、最も優れたモデルであっても、人間レベルのリアリズムやクロスモーダル一貫性には大きく及ばず、音声のリアリズム、細粒度の同期、指示追従性などにおいて持続的な課題があることが明らかになった。これらの結果は、将来のモデルにおける大幅な改善の余地を示すとともに、T2AV-Compassがテキストからの音声付き動画生成の発展に向けた挑戦的かつ診断的なテストベッドとして価値を持つことを裏付けている。

English

Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV-Compass introduces a dual-level evaluation framework that integrates objective signal-level metrics for video quality, audio quality, and cross-modal alignment with a subjective MLLM-as-a-Judge protocol for instruction following and realism assessment. Extensive evaluation of 11 representative T2AVsystems reveals that even the strongest models fall substantially short of human-level realism and cross-modal consistency, with persistent failures in audio realism, fine-grained synchronization, instruction following, etc. These results indicate significant improvement room for future models and highlight the value of T2AV-Compass as a challenging and diagnostic testbed for advancing text-to-audio-video generation.

T2AV-Compass: テキストからの音声・映像生成のための統合的評価に向けて

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

要旨

Support