ChatPaper.aiChatPaper

利用高斯先验改进声效风格迁移的推理时优化

Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior

May 16, 2025
作者: Chin-Yun Yu, Marco A. Martínez-Ramírez, Junghyun Koo, Wei-Hsiang Liao, Yuki Mitsufuji, György Fazekas
cs.AI

摘要

基于推理时优化的风格迁移(ST-ITO)是一种新兴方法,旨在将参考音频的应用效果转移至原始音频轨道。该方法通过优化效果参数,以最小化处理后音频与参考音频风格嵌入之间的距离。然而,此方法对所有可能的配置一视同仁,且仅依赖嵌入空间,可能导致不切实际或带有偏差的结果。为解决这一问题,我们引入了一个基于人声预设数据集DiffVox的高斯先验,应用于参数空间。由此产生的优化过程等同于最大后验估计。在MedleyDB数据集上对人声效果迁移的评估显示,相较于基线方法,包括盲音频效果估计器、最近邻方法及未经校准的ST-ITO,所提方法在各项指标上均有显著提升。校准后的方法将参数均方误差降低了高达33%,并更好地匹配了参考风格。16名参与者的主观评价进一步证实了本方法的优越性,特别是在数据有限的情况下。本研究表明,在推理时融入先验知识能够增强音频效果迁移,为开发更高效、更逼真的音频处理系统铺平了道路。
English
Style Transfer with Inference-Time Optimisation (ST-ITO) is a recent approach for transferring the applied effects of a reference audio to a raw audio track. It optimises the effect parameters to minimise the distance between the style embeddings of the processed audio and the reference. However, this method treats all possible configurations equally and relies solely on the embedding space, which can lead to unrealistic or biased results. We address this pitfall by introducing a Gaussian prior derived from a vocal preset dataset, DiffVox, over the parameter space. The resulting optimisation is equivalent to maximum-a-posteriori estimation. Evaluations on vocal effects transfer on the MedleyDB dataset show significant improvements across metrics compared to baselines, including a blind audio effects estimator, nearest-neighbour approaches, and uncalibrated ST-ITO. The proposed calibration reduces parameter mean squared error by up to 33% and matches the reference style better. Subjective evaluations with 16 participants confirm our method's superiority, especially in limited data regimes. This work demonstrates how incorporating prior knowledge in inference time enhances audio effects transfer, paving the way for more effective and realistic audio processing systems.

Summary

AI-Generated Summary

PDF02May 19, 2025