PicoAudio:實現在文本轉語音生成中對音頻事件的精確時間戳和頻率可控性
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation
July 3, 2024
作者: Zeyu Xie, Xuenan Xu, Zhizheng Wu, Mengyue Wu
cs.AI
摘要
最近,音訊生成任務引起了相當多的研究興趣。精確的時間可控性對將音訊生成與實際應用整合至關重要。在這項工作中,我們提出了一個名為PicoAudio的時間可控音訊生成框架。PicoAudio通過量身定制的模型設計,整合時間信息來引導音訊生成。它利用數據爬取、分割、過濾以及模擬細粒度時間對齊的音訊文本數據。主客觀評估均顯示,PicoAudio在時間戳和事件發生頻率可控性方面遠遠超越了當前最先進的生成模型。生成的樣本可在演示網站https://PicoAudio.github.io 上找到。
English
Recently, audio generation tasks have attracted considerable research
interests. Precise temporal controllability is essential to integrate audio
generation with real applications. In this work, we propose a temporal
controlled audio generation framework, PicoAudio. PicoAudio integrates temporal
information to guide audio generation through tailored model design. It
leverages data crawling, segmentation, filtering, and simulation of
fine-grained temporally-aligned audio-text data. Both subjective and objective
evaluations demonstrate that PicoAudio dramantically surpasses current
state-of-the-art generation models in terms of timestamp and occurrence
frequency controllability. The generated samples are available on the demo
website https://PicoAudio.github.io.Summary
AI-Generated Summary