通过表示相似性正则化增强音频生成的可控性

摘要

本文提出了一种创新方法，通过在模型训练过程中强调音频和文本表示之间的对齐来增强音频生成的控制能力。在基于语言模型的音频生成背景下，该模型利用来自文本和音频令牌表示的输入来预测随后的音频令牌。然而，当前的配置缺乏明确的正则化以确保所选文本表示与语言模型的预测之间的对齐。我们的提议涉及在分类器自由指导（CFG）阶段期间特别包含音频和文本表示正则化，其中在语言模型训练期间排除了文本条件的交叉注意力。这种提议的表示正则化旨在最小化音频和文本之间的差异，与同一训练批次中的其他样本相比。对音乐和音频生成任务的实验结果表明，我们提出的方法导致了音频和音乐生成的客观指标的改善，以及对音频生成的人类感知的增强。

English

This paper presents an innovative approach to enhance control over audio generation by emphasizing the alignment between audio and text representations during model training. In the context of language model-based audio generation, the model leverages input from both textual and audio token representations to predict subsequent audio tokens. However, the current configuration lacks explicit regularization to ensure the alignment between the chosen text representation and the language model's predictions. Our proposal involves the incorporation of audio and text representation regularization, particularly during the classifier-free guidance (CFG) phase, where the text condition is excluded from cross attention during language model training. The aim of this proposed representation regularization is to minimize discrepancies in audio and text similarity compared to other samples within the same training batch. Experimental results on both music and audio generation tasks demonstrate that our proposed methods lead to improvements in objective metrics for both audio and music generation, as well as an enhancement in the human perception for audio generation.

通过表示相似性正则化增强音频生成的可控性

Enhance audio generation controllability through representation similarity regularization

摘要

Support