이산 확산을 통한 토큰 기반 오디오 인페인팅

초록

오디오 인페인팅(audio inpainting)은 손상된 오디오 녹음에서 누락된 구간을 재구성하는 작업을 의미합니다. 기존의 접근 방식들—파형(waveform) 및 스펙트로그램(spectrogram) 기반 확산 모델(diffusion model)을 포함하여—은 짧은 간격에 대해 유망한 결과를 보여주었지만, 간격이 100밀리초(ms)를 초과할 경우 품질이 저하되는 경향이 있었습니다. 본 연구에서는 사전 훈련된 오디오 토크나이저(audio tokenizer)에 의해 생성된 토큰화된 오디오 표현을 기반으로 하는 이산 확산 모델링(discrete diffusion modeling)을 활용한 새로운 인페인팅 방법을 제안합니다. 우리의 접근 방식은 이산 잠재 공간(discrete latent space)에서 생성 과정을 직접 모델링함으로써, 안정적이고 의미론적으로 일관된 오디오 재구성을 가능하게 합니다. MusicNet 데이터셋을 사용하여 최대 300ms까지의 간격 지속 시간에 대해 객관적 및 지각적 지표를 통해 이 방법을 평가했습니다. 또한 MTG 데이터셋에서도 평가를 진행하여 간격 지속 시간을 500ms까지 확장했습니다. 실험 결과는 우리의 방법이 특히 더 긴 간격에 대해 기존의 베이스라인과 비교하여 경쟁력 있거나 우수한 성능을 달성함을 보여주며, 저하된 음악 녹음을 복원하기 위한 강력한 솔루션을 제공합니다. 제안된 방법의 오디오 예제는 https://iftach21.github.io/에서 확인할 수 있습니다.

English

Audio inpainting refers to the task of reconstructing missing segments in corrupted audio recordings. While prior approaches-including waveform and spectrogram-based diffusion models-have shown promising results for short gaps, they often degrade in quality when gaps exceed 100 milliseconds (ms). In this work, we introduce a novel inpainting method based on discrete diffusion modeling, which operates over tokenized audio representations produced by a pre-trained audio tokenizer. Our approach models the generative process directly in the discrete latent space, enabling stable and semantically coherent reconstruction of missing audio. We evaluate the method on the MusicNet dataset using both objective and perceptual metrics across gap durations up to 300 ms. We further evaluated our approach on the MTG dataset, extending the gap duration to 500 ms. Experimental results demonstrate that our method achieves competitive or superior performance compared to existing baselines, particularly for longer gaps, offering a robust solution for restoring degraded musical recordings. Audio examples of our proposed method can be found at https://iftach21.github.io/

이산 확산을 통한 토큰 기반 오디오 인페인팅

Token-based Audio Inpainting via Discrete Diffusion

초록

Support