AudioSR: 대규모 다용도 오디오 초해상도 기술

초록

오디오 초해상도는 저해상도 오디오의 고주파수 성분을 예측하여 디지털 애플리케이션에서 오디오 품질을 향상시키는 기본적인 작업이다. 기존 방법들은 다룰 수 있는 오디오 유형(예: 음악, 음성)과 특정 대역폭 설정(예: 4kHz에서 8kHz)의 제한적인 범위와 같은 한계점을 가지고 있다. 본 논문에서는 사운드 효과, 음악, 음성을 포함한 다양한 오디오 유형에 대해 강력한 오디오 초해상도를 수행할 수 있는 확산 기반 생성 모델인 AudioSR을 소개한다. 구체적으로, AudioSR은 2kHz에서 16kHz 대역폭 범위 내의 모든 입력 오디오 신호를 48kHz 샘플링 레이트의 24kHz 대역폭 고해상도 오디오 신호로 업샘플링할 수 있다. 다양한 오디오 초해상도 벤치마크에 대한 광범위한 객관적 평가는 제안된 모델이 달성한 강력한 결과를 보여준다. 또한, 주관적 평가를 통해 AudioSR이 AudioLDM, Fastspeech2, MusicGen을 포함한 다양한 오디오 생성 모델의 생성 품질을 향상시키는 플러그 앤 플레이 모듈로 작동할 수 있음을 보여준다. 우리의 코드와 데모는 https://audioldm.github.io/audiosr에서 확인할 수 있다.

English

Audio super-resolution is a fundamental task that predicts high-frequency components for low-resolution audio, enhancing audio quality in digital applications. Previous methods have limitations such as the limited scope of audio types (e.g., music, speech) and specific bandwidth settings they can handle (e.g., 4kHz to 8kHz). In this paper, we introduce a diffusion-based generative model, AudioSR, that is capable of performing robust audio super-resolution on versatile audio types, including sound effects, music, and speech. Specifically, AudioSR can upsample any input audio signal within the bandwidth range of 2kHz to 16kHz to a high-resolution audio signal at 24kHz bandwidth with a sampling rate of 48kHz. Extensive objective evaluation on various audio super-resolution benchmarks demonstrates the strong result achieved by the proposed model. In addition, our subjective evaluation shows that AudioSR can acts as a plug-and-play module to enhance the generation quality of a wide range of audio generative models, including AudioLDM, Fastspeech2, and MusicGen. Our code and demo are available at https://audioldm.github.io/audiosr.

AudioSR: 대규모 다용도 오디오 초해상도 기술

AudioSR: Versatile Audio Super-resolution at Scale

초록

Support