WavFlow: Audiogeneratie in de golfvormruimte

Samenvatting

Moderne audiogeneratie vertrouwt voornamelijk op compressie in de latent ruimte, wat extra complexiteit en mogelijk informatieverlies met zich meebrengt. In dit werk dagen we dit paradigma uit met WavFlow, een raamwerk dat rechtstreeks in de ruwe golfvormruimte audio van hoge kwaliteit genereert, zonder tussenliggende representaties. Om de inherente moeilijkheden van het modelleren van hoogdimensionale en laagenergetische signalen te overwinnen, hervormen we audio tot 2D-tokenrasters door middel van golfvorm-patchificatie en introduceren we amplitudeverhoging om signaalschalen op elkaar af te stemmen, waardoor stabiele optimalisatie via directe x-voorspelling in flow-matching mogelijk wordt. Om complexe semantische afstemming en temporele synchronisatie te vatten, maken we gebruik van een geautomatiseerde datapijplijn om 5 miljoen hoogwaardige video-tekst-audio-triples te cureren, waardoor het model fijnmazige akoestische patronen vanaf nul kan leren. Experimentele resultaten tonen aan dat WavFlow concurrerende prestaties levert op de video-naar-audio-benchmark VGGSound (FD_PaSST: 59,98, IS_PANNs: 17,40, DeSync: 0,44) en de tekst-naar-audio-benchmark AudioCaps (FD_PANNs: 10,63, IS_PANNs: 12,62), waarbij het de prestaties van gevestigde latente methoden evenaart of overtreft. Ons werk toont aan dat tussentijdse compressie geen vereiste is voor hoogwaardige synthese, en biedt een eenvoudiger en schaalbaarder alternatief voor multimodale audiogeneratie.

English

Modern audio generation predominantly relies on latent-space compression, introducing additional complexity and potential information loss. In this work, we challenge this paradigm with WavFlow, a framework that generates high-fidelity audio directly in raw waveform space without intermediate representations. To overcome the inherent difficulties of modeling high-dimensional and low-energy signals, we reshape audio into 2D token grids through waveform patchify and introduce amplitude lifting to align signal scales, enabling stable optimization via direct x-prediction in flow matching. To capture complex semantic alignment and temporal synchronization, we leverage an automated data pipeline to curate 5 million high-quality video-text-audio triplets, allowing the model to learn fine-grained acoustic patterns from scratch. Experimental results show that WavFlow achieves competitive performance on the video-to-audio benchmark VGGSound (FD_PaSST: 59.98, IS_PANNs: 17.40, DeSync: 0.44) and the text-to-audio benchmark AudioCaps (FD_PANNs: 10.63, IS_PANNs: 12.62), matching or exceeding the performance of established latent-based methods. Our work demonstrates that intermediate compression is not a prerequisite for high-quality synthesis, offering a simpler and more scalable alternative for multimodal audio generation.