LongAV-Compass: T2AV、I2AV、V2AVにわたる分スケールの音声-映像生成の統一的評価に向けて

要旨

音声・視覚生成は短いクリップから数分のコンテンツへと急速に進展しているが、既存の評価プロトコルは主に短尺形式に限定されたままである。既存のベンチマークは主に5～10秒のテキスト条件付き生成に焦点を当てており、テキスト、画像、動画の条件付けモダリティを横断した統合評価をほとんどサポートしていない。さらに、アイデンティティの一貫性、ナラティブの整合性、音声と視覚の同期が長時間にわたってどのように劣化するかについての洞察は限られている。このギャップを埋めるために、我々はLongAV-Compassを導入する。これは、1分間の音声・視覚生成のための体系的なベンチマークである。LongAV-Compassは、テキストから音声動画（T2AV）、画像から音声動画（I2AV）、動画から音声動画（V2AV）にわたる284の厳選されたテストケースを含み、アプリケーションシナリオと生成複雑性によって整理されている。本ベンチマークは、分類学に基づくベンチマーク構築と、MLLM支援評価をDINO-v2、ArcFace、CLIP、ImageBindなどの補完的な知覚・マルチモーダル指標と統合した統一評価フレームワークを組み合わせている。このフレームワークは、セグメント内品質、セグメント間一貫性、全体的なナラティブの整合性、意味的アライメント、音声と視覚の同期をカバーする20以上の細かい次元を評価する。11の代表的なモデルに対する実験と人間によるアライメント検証を通じて、LongAV-Compassは、多様な入力モダリティにわたって一貫性、意味的アライメント、時間的整合性を維持する1分スケールの音声・視覚生成における現在のシステムの限界を分析するための診断テストベッドを提供する。

English

Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5--10 second text-conditioned generation and rarely support unified evaluation across text, image, and video conditioning modalities. Moreover, they provide limited insight into how identity consistency, narrative coherence, and audio-visual alignment degrade over extended temporal horizons. To bridge this gap, we introduce LongAV-Compass, a systematic benchmark for minute-long audio-visual generation. LongAV-Compass contains 284 curated test cases spanning text-to-audio-video (T2AV), image-to-audio-video (I2AV), and video-to-audio-video (V2AV), organized by application scenario and generation complexity. The benchmark combines taxonomy-guided benchmark construction with a unified evaluation framework that integrates MLLM-assisted assessment with complementary perceptual and multimodal metrics, including DINO-v2, ArcFace, CLIP, and ImageBind. The framework evaluates more than 20 fine-grained dimensions covering within-segment quality, cross-segment consistency, global narrative coherence, semantic alignment, and audio-visual synchronization. Through experiments on 11 representative models together with human-alignment validation, LongAV-Compass provides a diagnostic testbed for analyzing the limitations of current systems in sustaining coherent, semantically aligned, and temporally consistent minute-scale audio-visual generation across diverse input modalities.