LongAV-Compass: T2AV, I2AV, V2AV를 아우르는 분 단위 오디오-비주얼 생성의 통합 평가를 향하여

초록

시청각 생성 기술은 짧은 클립에서 분 단위 콘텐츠로 빠르게 발전하고 있지만, 기존 평가 프로토콜은 대부분 짧은 형식에 국한되어 있다. 현재의 벤치마크는 주로 5~10초 길이의 텍스트 조건 생성에 초점을 맞추며, 텍스트, 이미지, 비디오 조건 입력 방식을 통합적으로 평가하는 경우는 드물다. 또한, 정체성 일관성, 서사적 응집성, 시청각 정렬이 시간이 지남에 따라 어떻게 저하되는지에 대한 통찰을 거의 제공하지 않는다. 이러한 격차를 해소하기 위해 우리는 LongAV-Compass를 제안한다. 이는 분 단위 시청각 생성을 위한 체계적인 벤치마크이다. LongAV-Compass는 텍스트-오디오-비디오(T2AV), 이미지-오디오-비디오(I2AV), 비디오-오디오-비디오(V2AV)를 아우르는 284개의 선별된 테스트 케이스를 포함하며, 응용 시나리오와 생성 복잡성에 따라 구성된다. 이 벤치마크는 분류 체계 기반의 벤치마크 구축과 통합 평가 프레임워크를 결합하며, MLLM 지원 평가와 DINO-v2, ArcFace, CLIP, ImageBind를 포함한 보완적 지각 및 다중 모달 메트릭스를 통합한다. 프레임워크는 세그먼트 내 품질, 세그먼트 간 일관성, 전반적 서사 응집성, 의미 정렬, 시청각 동기화를 포함한 20개 이상의 세분화된 차원을 평가한다. 11개 대표 모델에 대한 실험과 인간 정합성 검증을 통해 LongAV-Compass는 다양한 입력 방식에서 일관되고 의미적으로 정렬되며 시간적으로 일관된 분 단위 시청각 생성을 유지하는 데 있어 현재 시스템의 한계를 분석하기 위한 진단적 테스트베드를 제공한다.

English

Audio-visual generation is rapidly advancing from short clips to minute-long content, while existing evaluation protocols remain largely confined to short-form settings. Existing benchmarks primarily focus on 5--10 second text-conditioned generation and rarely support unified evaluation across text, image, and video conditioning modalities. Moreover, they provide limited insight into how identity consistency, narrative coherence, and audio-visual alignment degrade over extended temporal horizons. To bridge this gap, we introduce LongAV-Compass, a systematic benchmark for minute-long audio-visual generation. LongAV-Compass contains 284 curated test cases spanning text-to-audio-video (T2AV), image-to-audio-video (I2AV), and video-to-audio-video (V2AV), organized by application scenario and generation complexity. The benchmark combines taxonomy-guided benchmark construction with a unified evaluation framework that integrates MLLM-assisted assessment with complementary perceptual and multimodal metrics, including DINO-v2, ArcFace, CLIP, and ImageBind. The framework evaluates more than 20 fine-grained dimensions covering within-segment quality, cross-segment consistency, global narrative coherence, semantic alignment, and audio-visual synchronization. Through experiments on 11 representative models together with human-alignment validation, LongAV-Compass provides a diagnostic testbed for analyzing the limitations of current systems in sustaining coherent, semantically aligned, and temporally consistent minute-scale audio-visual generation across diverse input modalities.