通過表示相似性正則化來增強音頻生成的可控性。
Enhance audio generation controllability through representation similarity regularization
September 15, 2023
作者: Yangyang Shi, Gael Le Lan, Varun Nagaraja, Zhaoheng Ni, Xinhao Mei, Ernie Chang, Forrest Iandola, Yang Liu, Vikas Chandra
cs.AI
摘要
本文提出了一種創新的方法,通過在模型訓練期間強調音頻和文本表示之間的對齊來增強對音頻生成的控制。在基於語言模型的音頻生成背景下,該模型利用來自文本和音頻令牌表示的輸入來預測後續的音頻令牌。然而,目前的配置缺乏明確的正則化來確保所選文本表示和語言模型預測之間的對齊。我們的提議涉及音頻和文本表示的正則化,特別是在無分類器指導(CFG)階段,其中在語言模型訓練期間排除了文本條件的交叉注意力。此提議的表示正則化旨在最小化音頻和文本相似性與同一訓練批次中其他樣本之間的差異。音樂和音頻生成任務的實驗結果表明,我們提出的方法導致了音頻和音樂生成的客觀指標的改善,以及對音頻生成的人類感知的增強。
English
This paper presents an innovative approach to enhance control over audio
generation by emphasizing the alignment between audio and text representations
during model training. In the context of language model-based audio generation,
the model leverages input from both textual and audio token representations to
predict subsequent audio tokens. However, the current configuration lacks
explicit regularization to ensure the alignment between the chosen text
representation and the language model's predictions. Our proposal involves the
incorporation of audio and text representation regularization, particularly
during the classifier-free guidance (CFG) phase, where the text condition is
excluded from cross attention during language model training. The aim of this
proposed representation regularization is to minimize discrepancies in audio
and text similarity compared to other samples within the same training batch.
Experimental results on both music and audio generation tasks demonstrate that
our proposed methods lead to improvements in objective metrics for both audio
and music generation, as well as an enhancement in the human perception for
audio generation.