UniAudio: 보편적 오디오 생성을 위한 오디오 기초 모델

초록

언어 모델(LMs)은 다양한 생성 작업을 처리할 수 있는 능력을 입증해 왔습니다. 본 논문은 기존의 작업별 접근 방식과 달리, 언어 모델 기술을 활용하여 주어진 입력 조건에 따라 음성, 소리, 음악, 노래 등 다양한 유형의 오디오를 생성하는 UniAudio 시스템을 소개합니다. UniAudio는 1) 먼저 모든 유형의 대상 오디오와 다른 조건 모달리티를 토큰화하고, 2) 소스-대상 쌍을 단일 시퀀스로 연결한 후, 3) 언어 모델을 사용하여 다음 토큰을 예측합니다. 또한, 토큰화 과정에서 잔차 벡터 양자화 기반 신경 코덱으로 인해 발생하는 과도하게 긴 시퀀스를 처리하기 위해 다중 스케일 트랜스포머 모델을 제안합니다. UniAudio의 학습은 모든 생성 작업을 기반으로 165,000시간의 오디오와 10억 개의 파라미터로 확장되어, 오디오의 내재적 특성뿐만 아니라 오디오와 다른 모달리티 간의 상호 관계에 대한 충분한 사전 지식을 얻는 것을 목표로 합니다. 따라서 학습된 UniAudio 모델은 범용 오디오 생성을 위한 기초 모델이 될 잠재력을 가지고 있습니다: 이 모델은 모든 학습된 작업에서 강력한 성능을 보이며, 간단한 미세 조정 후 새로운 오디오 생성 작업을 원활하게 지원할 수 있습니다. 실험 결과, UniAudio는 11개 작업 중 대부분에서 최첨단 또는 적어도 경쟁력 있는 결과를 달성함을 보여줍니다. 데모와 코드는 https://github.com/yangdongchao/UniAudio에서 공개되었습니다.

English

Language models (LMs) have demonstrated the capability to handle a variety of generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific approaches, leverages LMs techniques to generate multiple types of audio (including speech, sounds, music, and singing) with given input conditions. UniAudio 1) first tokenizes all types of target audio along with other condition modalities, 2) concatenates source-target pair as a single sequence, and 3) performs next-token prediction using LMs. Also, a multi-scale Transformer model is proposed to handle the overly long sequences caused by the residual vector quantization based neural codec in tokenization. Training of UniAudio is scaled up to 165K hours of audio and 1B parameters, based on all generative tasks, aiming to obtain sufficient prior knowledge not only in the intrinsic properties of audio but also the inter-relationship between audio and other modalities. Therefore, the trained UniAudio model has the potential to become a foundation model for universal audio generation: it shows strong capability in all trained tasks and can seamlessly support new audio generation tasks after simple fine-tuning. Experiments demonstrate that UniAudio achieves state-of-the-art or at least competitive results on most of the 11 tasks. Demo and code are released at https://github.com/yangdongchao/UniAudio

UniAudio: 보편적 오디오 생성을 위한 오디오 기초 모델

UniAudio: An Audio Foundation Model Toward Universal Audio Generation

초록

Support