PicoAudio: 텍스트-오디오 생성에서 오디오 이벤트의 정밀한 타임스탬프 및 주파수 제어 가능성 제공

초록

최근 오디오 생성 작업이 상당한 연구 관심을 끌고 있다. 실질적인 애플리케이션에 오디오 생성을 통합하기 위해서는 정밀한 시간 제어가 필수적이다. 본 연구에서는 시간 제어가 가능한 오디오 생성 프레임워크인 PicoAudio를 제안한다. PicoAudio는 맞춤형 모델 설계를 통해 오디오 생성을 안내하는 시간 정보를 통합한다. 이 프레임워크는 데이터 크롤링, 분할, 필터링 및 세밀하게 시간 정렬된 오디오-텍스트 데이터의 시뮬레이션을 활용한다. 주관적 및 객관적 평가 모두에서 PicoAudio가 타임스탬프 및 발생 빈도 제어 가능성 측면에서 현재 최첨단 생성 모델을 크게 능가함을 보여준다. 생성된 샘플은 데모 웹사이트 https://PicoAudio.github.io에서 확인할 수 있다.

English

Recently, audio generation tasks have attracted considerable research interests. Precise temporal controllability is essential to integrate audio generation with real applications. In this work, we propose a temporal controlled audio generation framework, PicoAudio. PicoAudio integrates temporal information to guide audio generation through tailored model design. It leverages data crawling, segmentation, filtering, and simulation of fine-grained temporally-aligned audio-text data. Both subjective and objective evaluations demonstrate that PicoAudio dramantically surpasses current state-of-the-art generation models in terms of timestamp and occurrence frequency controllability. The generated samples are available on the demo website https://PicoAudio.github.io.