ChatPaper.aiChatPaper

Tango 2:透過直接偏好優化對齊基於擴散的文本轉語音生成

Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization

April 15, 2024
作者: Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, Soujanya Poria
cs.AI

摘要

生成式多模態內容在許多內容創作領域中日益普及,因為它有潛力讓藝術家和媒體人員通過快速將其想法具現化來創建預製樣品。從文本提示生成音頻是音樂和電影行業中這些過程的重要方面。許多最近基於擴散的文本轉音頻模型專注於在大量的提示-音頻對數據集上訓練日益複雜的擴散模型。這些模型並不明確關注輸出音頻中與輸入提示相關的概念或事件以及它們的時間順序。我們的假設是專注於這些音頻生成方面如何在有限數據存在的情況下改善音頻生成性能。因此,在這項工作中,我們使用現有的文本轉音頻模型Tango,合成創建了一個偏好數據集,其中每個提示都有一個優勝音頻輸出和一些失敗音頻輸出,供擴散模型學習。理論上,失敗的輸出中某些提示中的概念缺失或順序不正確。我們使用擴散-DPO(直接偏好優化)損失對我們的偏好數據集微調公開可用的Tango文本轉音頻模型,並展示這將在自動和手動評估指標方面比Tango和AudioLDM2帶來改進的音頻輸出。
English
Generative multimodal content is increasingly prevalent in much of the content creation arena, as it has the potential to allow artists and media personnel to create pre-production mockups by quickly bringing their ideas to life. The generation of audio from text prompts is an important aspect of such processes in the music and film industry. Many of the recent diffusion-based text-to-audio models focus on training increasingly sophisticated diffusion models on a large set of datasets of prompt-audio pairs. These models do not explicitly focus on the presence of concepts or events and their temporal ordering in the output audio with respect to the input prompt. Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data. As such, in this work, using an existing text-to-audio model Tango, we synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from. The loser outputs, in theory, have some concepts from the prompt missing or in an incorrect order. We fine-tune the publicly available Tango text-to-audio model using diffusion-DPO (direct preference optimization) loss on our preference dataset and show that it leads to improved audio output over Tango and AudioLDM2, in terms of both automatic- and manual-evaluation metrics.

Summary

AI-Generated Summary

PDF120December 15, 2024