通過表示相似性正則化來增強音頻生成的可控性。

摘要

本文提出了一種創新的方法，通過在模型訓練期間強調音頻和文本表示之間的對齊來增強對音頻生成的控制。在基於語言模型的音頻生成背景下，該模型利用來自文本和音頻令牌表示的輸入來預測後續的音頻令牌。然而，目前的配置缺乏明確的正則化來確保所選文本表示和語言模型預測之間的對齊。我們的提議涉及音頻和文本表示的正則化，特別是在無分類器指導（CFG）階段，其中在語言模型訓練期間排除了文本條件的交叉注意力。此提議的表示正則化旨在最小化音頻和文本相似性與同一訓練批次中其他樣本之間的差異。音樂和音頻生成任務的實驗結果表明，我們提出的方法導致了音頻和音樂生成的客觀指標的改善，以及對音頻生成的人類感知的增強。

English

This paper presents an innovative approach to enhance control over audio generation by emphasizing the alignment between audio and text representations during model training. In the context of language model-based audio generation, the model leverages input from both textual and audio token representations to predict subsequent audio tokens. However, the current configuration lacks explicit regularization to ensure the alignment between the chosen text representation and the language model's predictions. Our proposal involves the incorporation of audio and text representation regularization, particularly during the classifier-free guidance (CFG) phase, where the text condition is excluded from cross attention during language model training. The aim of this proposed representation regularization is to minimize discrepancies in audio and text similarity compared to other samples within the same training batch. Experimental results on both music and audio generation tasks demonstrate that our proposed methods lead to improvements in objective metrics for both audio and music generation, as well as an enhancement in the human perception for audio generation.

通過表示相似性正則化來增強音頻生成的可控性。

Enhance audio generation controllability through representation similarity regularization

摘要

Support