DDPM 역전파를 활용한 제로샷 비지도 및 텍스트 기반 오디오 편집

초록

대규모 사전 학습 모델을 활용한 제로샷 방식의 신호 편집 기술은 최근 이미지 영역에서 급속한 발전을 이루었습니다. 그러나 이러한 흐름은 아직 오디오 영역에 도달하지 못했습니다. 본 논문에서는 사전 학습된 확산 모델에 DDPM 역변환을 적용한 두 가지 제로샷 오디오 신호 편집 기법을 탐구합니다. 첫 번째 기법은 이미지 영역에서 차용된 것으로, 텍스트 기반 편집을 가능하게 합니다. 두 번째 기법은 감독 없이 의미론적으로 의미 있는 편집 방향을 발견하는 새로운 접근법입니다. 이 방법을 음악 신호에 적용하면 특정 악기의 참여도를 조절하거나 멜로디를 즉흥적으로 변형하는 등 다양한 음악적 흥미를 유발하는 수정 사항을 도출할 수 있습니다. 샘플은 https://hilamanor.github.io/AudioEditing/ 에서 확인할 수 있으며, 코드는 https://github.com/hilamanor/AudioEditing/ 에서 제공됩니다.

English

Editing signals using large pre-trained models, in a zero-shot manner, has recently seen rapid advancements in the image domain. However, this wave has yet to reach the audio domain. In this paper, we explore two zero-shot editing techniques for audio signals, which use DDPM inversion on pre-trained diffusion models. The first, adopted from the image domain, allows text-based editing. The second, is a novel approach for discovering semantically meaningful editing directions without supervision. When applied to music signals, this method exposes a range of musically interesting modifications, from controlling the participation of specific instruments to improvisations on the melody. Samples can be found on our examples page in https://hilamanor.github.io/AudioEditing/ and code can be found in https://github.com/hilamanor/AudioEditing/ .

DDPM 역전파를 활용한 제로샷 비지도 및 텍스트 기반 오디오 편집

Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion

초록

Support