基于离散扩散的Token化音频修复
Token-based Audio Inpainting via Discrete Diffusion
July 11, 2025
作者: Tali Dror, Iftach Shoham, Moshe Buchris, Oren Gal, Haim Permuter, Gilad Katz, Eliya Nachmani
cs.AI
摘要
音频修复是指重建受损音频记录中缺失片段的任务。尽管先前的方法——包括基于波形和频谱图的扩散模型——在短时间间隔上已展现出良好的效果,但当间隔超过100毫秒(ms)时,其质量往往显著下降。在本研究中,我们提出了一种基于离散扩散建模的新型修复方法,该方法通过预训练的音频标记器生成的标记化音频表示进行操作。我们的方法直接在离散潜在空间中建模生成过程,从而实现了缺失音频的稳定且语义连贯的重建。我们在MusicNet数据集上评估了该方法,使用客观和感知指标对长达300毫秒的间隔进行了测试。此外,我们还在MTG数据集上进一步评估了我们的方法,将间隔时间延长至500毫秒。实验结果表明,与现有基线相比,我们的方法实现了具有竞争力或更优的性能,尤其是在处理较长时间间隔时,为恢复受损音乐录音提供了一个稳健的解决方案。我们提出的方法的音频示例可在https://iftach21.github.io/找到。
English
Audio inpainting refers to the task of reconstructing missing segments in
corrupted audio recordings. While prior approaches-including waveform and
spectrogram-based diffusion models-have shown promising results for short gaps,
they often degrade in quality when gaps exceed 100 milliseconds (ms). In this
work, we introduce a novel inpainting method based on discrete diffusion
modeling, which operates over tokenized audio representations produced by a
pre-trained audio tokenizer. Our approach models the generative process
directly in the discrete latent space, enabling stable and semantically
coherent reconstruction of missing audio. We evaluate the method on the
MusicNet dataset using both objective and perceptual metrics across gap
durations up to 300 ms. We further evaluated our approach on the MTG dataset,
extending the gap duration to 500 ms. Experimental results demonstrate that our
method achieves competitive or superior performance compared to existing
baselines, particularly for longer gaps, offering a robust solution for
restoring degraded musical recordings. Audio examples of our proposed method
can be found at https://iftach21.github.io/