基於Token的音頻修復:離散擴散方法
Token-based Audio Inpainting via Discrete Diffusion
July 11, 2025
作者: Tali Dror, Iftach Shoham, Moshe Buchris, Oren Gal, Haim Permuter, Gilad Katz, Eliya Nachmani
cs.AI
摘要
音頻修復(Audio inpainting)指的是重建受損音頻錄音中缺失片段的一項任務。雖然先前的方法——包括基於波形和頻譜圖的擴散模型——在處理短暫缺失時已展現出良好的效果,但當缺失超過100毫秒(ms)時,其質量往往會下降。在本研究中,我們提出了一種基於離散擴散建模的新穎修復方法,該方法操作於由預訓練音頻標記器生成的標記化音頻表示之上。我們的方法直接在離散潛在空間中模擬生成過程,從而實現了缺失音頻的穩定且語義連貫的重建。我們在MusicNet數據集上使用客觀和感知指標對該方法進行了評估,涵蓋了長達300毫秒的缺失時長。此外,我們還在MTG數據集上對我們的方法進行了評估,將缺失時長擴展至500毫秒。實驗結果表明,與現有基線相比,我們的方法實現了競爭性或更優的性能,特別是在處理較長缺失時,為恢復受損音樂錄音提供了一個穩健的解決方案。我們所提出方法的音頻示例可在https://iftach21.github.io/找到。
English
Audio inpainting refers to the task of reconstructing missing segments in
corrupted audio recordings. While prior approaches-including waveform and
spectrogram-based diffusion models-have shown promising results for short gaps,
they often degrade in quality when gaps exceed 100 milliseconds (ms). In this
work, we introduce a novel inpainting method based on discrete diffusion
modeling, which operates over tokenized audio representations produced by a
pre-trained audio tokenizer. Our approach models the generative process
directly in the discrete latent space, enabling stable and semantically
coherent reconstruction of missing audio. We evaluate the method on the
MusicNet dataset using both objective and perceptual metrics across gap
durations up to 300 ms. We further evaluated our approach on the MTG dataset,
extending the gap duration to 500 ms. Experimental results demonstrate that our
method achieves competitive or superior performance compared to existing
baselines, particularly for longer gaps, offering a robust solution for
restoring degraded musical recordings. Audio examples of our proposed method
can be found at https://iftach21.github.io/