PicoAudio：实现文本转语音生成中音频事件的精确时间戳和频率可控性

摘要

近年来，音频生成任务引起了相当大的研究兴趣。精确的时间可控性对将音频生成与实际应用集成至关重要。在这项工作中，我们提出了一个名为PicoAudio的时间控制音频生成框架。PicoAudio通过定制的模型设计，整合时间信息来引导音频生成。它利用数据爬取、分割、过滤和模拟细粒度时间对齐的音频文本数据。主客观评估表明，PicoAudio在时间戳和事件发生频率可控性方面明显优于当前最先进的生成模型。生成的样本可在演示网站https://PicoAudio.github.io 上获取。

English

Recently, audio generation tasks have attracted considerable research interests. Precise temporal controllability is essential to integrate audio generation with real applications. In this work, we propose a temporal controlled audio generation framework, PicoAudio. PicoAudio integrates temporal information to guide audio generation through tailored model design. It leverages data crawling, segmentation, filtering, and simulation of fine-grained temporally-aligned audio-text data. Both subjective and objective evaluations demonstrate that PicoAudio dramantically surpasses current state-of-the-art generation models in terms of timestamp and occurrence frequency controllability. The generated samples are available on the demo website https://PicoAudio.github.io.

PicoAudio：实现文本转语音生成中音频事件的精确时间戳和频率可控性

PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation

摘要

Support