利用高斯先驗改進語音效果風格轉換的推理時間優化
Improving Inference-Time Optimisation for Vocal Effects Style Transfer with a Gaussian Prior
May 16, 2025
作者: Chin-Yun Yu, Marco A. Martínez-Ramírez, Junghyun Koo, Wei-Hsiang Liao, Yuki Mitsufuji, György Fazekas
cs.AI
摘要
風格轉移與推理時間優化(ST-ITO)是一種新近的方法,用於將參考音頻的應用效果轉移到原始音頻軌道上。該方法通過優化效果參數來最小化處理後音頻與參考音頻在風格嵌入空間中的距離。然而,這種方法平等對待所有可能的配置,並僅依賴於嵌入空間,這可能導致不切實際或偏頗的結果。我們通過在參數空間中引入基於人聲預設數據集DiffVox的高斯先驗來解決這一缺陷。由此產生的優化等同於最大後驗估計。在MedleyDB數據集上對人聲效果轉移的評估顯示,與基線方法相比,包括盲音頻效果估計器、最近鄰方法以及未校準的ST-ITO,該方法在各項指標上均有顯著提升。所提出的校準方法將參數均方誤差降低了最多33%,並更好地匹配了參考風格。16名參與者的主觀評估證實了我們方法的優越性,特別是在數據有限的情況下。這項工作展示了如何在推理時間內融入先驗知識以增強音頻效果轉移,為更有效和逼真的音頻處理系統鋪平了道路。
English
Style Transfer with Inference-Time Optimisation (ST-ITO) is a recent approach
for transferring the applied effects of a reference audio to a raw audio track.
It optimises the effect parameters to minimise the distance between the style
embeddings of the processed audio and the reference. However, this method
treats all possible configurations equally and relies solely on the embedding
space, which can lead to unrealistic or biased results. We address this pitfall
by introducing a Gaussian prior derived from a vocal preset dataset, DiffVox,
over the parameter space. The resulting optimisation is equivalent to
maximum-a-posteriori estimation. Evaluations on vocal effects transfer on the
MedleyDB dataset show significant improvements across metrics compared to
baselines, including a blind audio effects estimator, nearest-neighbour
approaches, and uncalibrated ST-ITO. The proposed calibration reduces parameter
mean squared error by up to 33% and matches the reference style better.
Subjective evaluations with 16 participants confirm our method's superiority,
especially in limited data regimes. This work demonstrates how incorporating
prior knowledge in inference time enhances audio effects transfer, paving the
way for more effective and realistic audio processing systems.Summary
AI-Generated Summary