ChatPaper.aiChatPaper

ViSAudio:端到端影片驅動雙耳空間音訊生成系統

ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

December 2, 2025
作者: Mengchen Zhang, Qi Chen, Tong Wu, Zihan Liu, Dahua Lin
cs.AI

摘要

儘管影片轉音頻生成技術已取得進展,該領域目前仍主要聚焦於單聲道輸出,缺乏空間沉浸感。現有的雙耳音頻生成方法受限於兩階段流程:首先生成單聲道音頻,再進行空間化處理,這種方式往往導致誤差累積和時空不一致性。為解決此局限性,我們提出了從無聲影片直接生成端到端雙耳空間音頻的新任務。為支持此任務,我們構建了BiAudio數據集,該數據集通過半自動化流程整合了約9.7萬個影片-雙耳音頻對應樣本,涵蓋多樣化的真實場景與相機旋轉軌跡。此外,我們提出ViSAudio端到端框架,採用帶有雙分支音頻生成架構的條件流匹配技術,通過兩個專用分支對音頻潛在流進行建模。該框架結合條件時空模塊,在保持獨特空間特徵的同時平衡聲道間的一致性,確保音頻與輸入影片的精準時空對位。綜合實驗表明,ViSAudio在客觀指標與主觀評估上均超越現有頂尖方法,能生成具有空間沉浸感的高質量雙耳音頻,並可有效適應視角變化、聲源移動及多樣化聲學環境。項目網站:https://kszpxxzmc.github.io/ViSAudio-project。
English
Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project.
PDF201December 4, 2025