ViSAudio:端到端视频驱动的双耳空间音频生成
ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation
December 2, 2025
作者: Mengchen Zhang, Qi Chen, Tong Wu, Zihan Liu, Dahua Lin
cs.AI
摘要
尽管视频到音频生成领域已取得进展,但现有研究主要聚焦于单声道输出,缺乏空间沉浸感。当前的双声道方法受限于两阶段流程:首先生成单声道音频,随后进行空间化处理,这往往导致误差累积和时空不一致问题。为突破这一局限,我们提出了从无声视频直接生成端到端双声道空间音频的新任务。为支持该任务,我们构建了BiAudio数据集,通过半自动化流程整合了约9.7万个视频-双声道音频对,涵盖多样化的真实场景及摄像机旋转轨迹。进一步,我们提出ViSAudio端到端框架,采用条件流匹配技术与双分支音频生成架构,通过两个专用分支对音频潜在流进行建模。该框架结合条件时空模块,在保持独特空间特征的同时平衡声道间一致性,确保音频与输入视频的精准时空对齐。综合实验表明,ViSAudio在客观指标和主观评估上均优于现有先进方法,能生成具有空间沉浸感的高质量双声道音频,可有效适应视角变化、声源运动及多样声学环境。项目网站:https://kszpxxzmc.github.io/ViSAudio-project。
English
Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project.