VABench：音视频生成综合基准测试平台

摘要

近期视频生成技术取得了显著进展，使得模型能够生成具有同步音频的视觉吸引力视频。尽管现有视频生成基准测试提供了视觉质量的综合评估指标，但缺乏对音视频生成能力的可信评估，特别是针对同步音视频输出模型的评测。为填补这一空白，我们推出VABench——一个多维度综合基准测试框架，旨在系统评估同步音视频生成能力。该框架涵盖三大任务类型：文本到音视频生成（T2AV）、图像到音视频生成（I2AV）以及立体声音视频生成，并建立了两大评估模块共15个维度。这些维度专门评估文本-视频、文本-音频、视频-音频的成对相似度、音视频同步性、唇语-语音一致性，以及精心设计的音视频问答对等指标。此外，VABench覆盖七大内容类别：动物声效、人声、音乐、环境音、同步物理音效、复杂场景和虚拟世界。我们通过系统化结果分析与可视化呈现，旨在为具备同步音频能力的视频生成模型建立新的评估标准，推动该领域的全面发展。

English

Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.

VABench：音视频生成综合基准测试平台

VABench: A Comprehensive Benchmark for Audio-Video Generation

摘要

Support