PicoAudio: テキストから音声生成における音声イベントの正確なタイムスタンプと周波数制御を実現

要旨

近年、音声生成タスクは多くの研究関心を集めている。実用的なアプリケーションと音声生成を統合するためには、正確な時間制御性が不可欠である。本研究では、時間制御を可能にする音声生成フレームワーク「PicoAudio」を提案する。PicoAudioは、モデル設計を工夫することで時間情報を統合し、音声生成をガイドする。具体的には、データクローリング、セグメンテーション、フィルタリング、および細粒度の時間整合性を持つ音声-テキストデータのシミュレーションを活用している。主観的および客観的評価の結果、PicoAudioはタイムスタンプと発生頻度の制御性において、現在の最先端生成モデルを大幅に上回ることが示された。生成サンプルはデモウェブサイトhttps://PicoAudio.github.ioで公開されている。

English

Recently, audio generation tasks have attracted considerable research interests. Precise temporal controllability is essential to integrate audio generation with real applications. In this work, we propose a temporal controlled audio generation framework, PicoAudio. PicoAudio integrates temporal information to guide audio generation through tailored model design. It leverages data crawling, segmentation, filtering, and simulation of fine-grained temporally-aligned audio-text data. Both subjective and objective evaluations demonstrate that PicoAudio dramantically surpasses current state-of-the-art generation models in terms of timestamp and occurrence frequency controllability. The generated samples are available on the demo website https://PicoAudio.github.io.

PicoAudio: テキストから音声生成における音声イベントの正確なタイムスタンプと周波数制御を実現

PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation

要旨

Support