ViSAudio：端到端影片驅動雙耳空間音訊生成系統

摘要

儘管影片轉音頻生成技術已取得進展，該領域目前仍主要聚焦於單聲道輸出，缺乏空間沉浸感。現有的雙耳音頻生成方法受限於兩階段流程：首先生成單聲道音頻，再進行空間化處理，這種方式往往導致誤差累積和時空不一致性。為解決此局限性，我們提出了從無聲影片直接生成端到端雙耳空間音頻的新任務。為支持此任務，我們構建了BiAudio數據集，該數據集通過半自動化流程整合了約9.7萬個影片-雙耳音頻對應樣本，涵蓋多樣化的真實場景與相機旋轉軌跡。此外，我們提出ViSAudio端到端框架，採用帶有雙分支音頻生成架構的條件流匹配技術，通過兩個專用分支對音頻潛在流進行建模。該框架結合條件時空模塊，在保持獨特空間特徵的同時平衡聲道間的一致性，確保音頻與輸入影片的精準時空對位。綜合實驗表明，ViSAudio在客觀指標與主觀評估上均超越現有頂尖方法，能生成具有空間沉浸感的高質量雙耳音頻，並可有效適應視角變化、聲源移動及多樣化聲學環境。項目網站：https://kszpxxzmc.github.io/ViSAudio-project。

English

Despite progress in video-to-audio generation, the field focuses predominantly on mono output, lacking spatial immersion. Existing binaural approaches remain constrained by a two-stage pipeline that first generates mono audio and then performs spatialization, often resulting in error accumulation and spatio-temporal inconsistencies. To address this limitation, we introduce the task of end-to-end binaural spatial audio generation directly from silent video. To support this task, we present the BiAudio dataset, comprising approximately 97K video-binaural audio pairs spanning diverse real-world scenes and camera rotation trajectories, constructed through a semi-automated pipeline. Furthermore, we propose ViSAudio, an end-to-end framework that employs conditional flow matching with a dual-branch audio generation architecture, where two dedicated branches model the audio latent flows. Integrated with a conditional spacetime module, it balances consistency between channels while preserving distinctive spatial characteristics, ensuring precise spatio-temporal alignment between audio and the input video. Comprehensive experiments demonstrate that ViSAudio outperforms existing state-of-the-art methods across both objective metrics and subjective evaluations, generating high-quality binaural audio with spatial immersion that adapts effectively to viewpoint changes, sound-source motion, and diverse acoustic environments. Project website: https://kszpxxzmc.github.io/ViSAudio-project.

ViSAudio：端到端影片驅動雙耳空間音訊生成系統

ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation

摘要

Support