ViSAudio：端到端视频驱动的双耳空间音频生成

摘要

尽管视频到音频生成领域已取得进展，但现有研究主要聚焦于单声道输出，缺乏空间沉浸感。当前的双声道方法受限于两阶段流程：首先生成单声道音频，随后进行空间化处理，这往往导致误差累积和时空不一致问题。为突破这一局限，我们提出了从无声视频直接生成端到端双声道空间音频的新任务。为支持该任务，我们构建了BiAudio数据集，通过半自动化流程整合了约9.7万个视频-双声道音频对，涵盖多样化的真实场景及摄像机旋转轨迹。进一步，我们提出ViSAudio端到端框架，采用条件流匹配技术与双分支音频生成架构，通过两个专用分支对音频潜在流进行建模。该框架结合条件时空模块，在保持独特空间特征的同时平衡声道间一致性，确保音频与输入视频的精准时空对齐。综合实验表明，ViSAudio在客观指标和主观评估上均优于现有先进方法，能生成具有空间沉浸感的高质量双声道音频，可有效适应视角变化、声源运动及多样声学环境。项目网站：https://kszpxxzmc.github.io/ViSAudio-project。

English

Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project.