FastVoiceGrad: 적대적 조건 확산 증류를 사용한 단계별 화자 전환

초록

확산 기반 음성 변환 (VC) 기술인 VoiceGrad와 같은 기술은 발화 품질과 화자 유사성 측면에서 높은 VC 성능으로 인해 관심을 끌었습니다. 그러나, 다단계 역확산에 의한 느린 추론이 주목할 만한 제한 사항입니다. 따라서, 우리는 다수 단계의 확산 기반 VC의 높은 VC 성능을 유지하면서 반복 횟수를 수십 번에서 한 번으로 줄이는 혁신적인 단일 단계 확산 기반 VC인 FastVoiceGrad를 제안합니다. 우리는 생성 적 적대 신경망과 확산 모델의 능력을 활용하면서 샘플링에서 초기 상태를 재고려하는 적대적 조건부 확산 증류 (ACDD)를 사용하여 모델을 얻습니다. 한 번의 어떠한-어떠한 VC의 평가는 FastVoiceGrad가 이전 다단계 확산 기반 VC와 비교하여 뛰어난 VC 성능을 달성하면서 추론 속도를 향상시킨다는 것을 보여줍니다. 오디오 샘플은 다음 링크에서 확인할 수 있습니다: https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/fastvoicegrad/.

English

Diffusion-based voice conversion (VC) techniques such as VoiceGrad have attracted interest because of their high VC performance in terms of speech quality and speaker similarity. However, a notable limitation is the slow inference caused by the multi-step reverse diffusion. Therefore, we propose FastVoiceGrad, a novel one-step diffusion-based VC that reduces the number of iterations from dozens to one while inheriting the high VC performance of the multi-step diffusion-based VC. We obtain the model using adversarial conditional diffusion distillation (ACDD), leveraging the ability of generative adversarial networks and diffusion models while reconsidering the initial states in sampling. Evaluations of one-shot any-to-any VC demonstrate that FastVoiceGrad achieves VC performance superior to or comparable to that of previous multi-step diffusion-based VC while enhancing the inference speed. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/fastvoicegrad/.

FastVoiceGrad: 적대적 조건 확산 증류를 사용한 단계별 화자 전환

FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation

초록

Support