WavFlow:波形空間中的音頻生成
WavFlow: Audio Generation in Waveform Space
May 18, 2026
作者: Feiyan Zhou, Luyuan Wang, Shoufa Chen, Zhe Wang, Zhiheng Liu, Yuren Cong, Xiaohui Zhang, Fanny Yang, Belinda Zeng
cs.AI
摘要
現代音訊生成主要依賴於潛在空間壓縮,這引入了額外的複雜性與潛在的資訊損失。在本研究中,我們透過 WavFlow 框架挑戰此典範,該框架能直接在原始波形空間中生成高保真音訊,無需中間表示。為克服高維度與低能量訊號建模的固有困難,我們透過波形分塊將音訊重塑為二維 token 網格,並引入幅度提升以對齊訊號尺度,藉由流匹配中的直接 x 預測實現穩定優化。為捕捉複雜的語義對齊與時序同步,我們利用自動化資料管線篩選出 500 萬個高品質的影片-文字-音訊三元組,使模型能從零開始學習精細的聲學模式。實驗結果顯示,WavFlow 在影片轉音訊基準 VGGSound(FD_PaSST: 59.98,IS_PANNs: 17.40,DeSync: 0.44)與文字轉音訊基準 AudioCaps(FD_PANNs: 10.63,IS_PANNs: 12.62)上均達到競爭力表現,匹配甚至超越既有潛在空間方法的效能。我們的研究證明了中間壓縮並非高品質合成的必要條件,為多模態音訊生成提供了更簡潔且更具可擴展性的替代方案。
English
Modern audio generation predominantly relies on latent-space compression, introducing additional complexity and potential information loss. In this work, we challenge this paradigm with WavFlow, a framework that generates high-fidelity audio directly in raw waveform space without intermediate representations. To overcome the inherent difficulties of modeling high-dimensional and low-energy signals, we reshape audio into 2D token grids through waveform patchify and introduce amplitude lifting to align signal scales, enabling stable optimization via direct x-prediction in flow matching. To capture complex semantic alignment and temporal synchronization, we leverage an automated data pipeline to curate 5 million high-quality video-text-audio triplets, allowing the model to learn fine-grained acoustic patterns from scratch. Experimental results show that WavFlow achieves competitive performance on the video-to-audio benchmark VGGSound (FD_PaSST: 59.98, IS_PANNs: 17.40, DeSync: 0.44) and the text-to-audio benchmark AudioCaps (FD_PANNs: 10.63, IS_PANNs: 12.62), matching or exceeding the performance of established latent-based methods. Our work demonstrates that intermediate compression is not a prerequisite for high-quality synthesis, offering a simpler and more scalable alternative for multimodal audio generation.