PicoAudio:实现文本转语音生成中音频事件的精确时间戳和频率可控性
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation
July 3, 2024
作者: Zeyu Xie, Xuenan Xu, Zhizheng Wu, Mengyue Wu
cs.AI
摘要
近年来,音频生成任务引起了相当大的研究兴趣。精确的时间可控性对将音频生成与实际应用集成至关重要。在这项工作中,我们提出了一个名为PicoAudio的时间控制音频生成框架。PicoAudio通过定制的模型设计,整合时间信息来引导音频生成。它利用数据爬取、分割、过滤和模拟细粒度时间对齐的音频文本数据。主客观评估表明,PicoAudio在时间戳和事件发生频率可控性方面明显优于当前最先进的生成模型。生成的样本可在演示网站https://PicoAudio.github.io 上获取。
English
Recently, audio generation tasks have attracted considerable research
interests. Precise temporal controllability is essential to integrate audio
generation with real applications. In this work, we propose a temporal
controlled audio generation framework, PicoAudio. PicoAudio integrates temporal
information to guide audio generation through tailored model design. It
leverages data crawling, segmentation, filtering, and simulation of
fine-grained temporally-aligned audio-text data. Both subjective and objective
evaluations demonstrate that PicoAudio dramantically surpasses current
state-of-the-art generation models in terms of timestamp and occurrence
frequency controllability. The generated samples are available on the demo
website https://PicoAudio.github.io.Summary
AI-Generated Summary