개선된 RVQGAN을 활용한 고품질 오디오 압축

초록

언어 모델은 이미지, 음성, 음악과 같은 자연 신호를 모델링하는 데 성공적으로 사용되어 왔습니다. 이러한 모델의 핵심 구성 요소는 고차원의 자연 신호를 저차원의 이산 토큰으로 압축할 수 있는 고품질의 신경망 기반 압축 모델입니다. 이를 위해, 우리는 44.1KHz 오디오를 8kbps 대역폭의 토큰으로 약 90배 압축하는 고충실도 범용 신경망 오디오 압축 알고리즘을 소개합니다. 우리는 고충실도 오디오 생성 기술과 이미지 도메인에서 발전된 더 나은 벡터 양자화 기법, 그리고 개선된 적대적 및 재구성 손실을 결합하여 이를 달성했습니다. 우리는 단일 범용 모델로 모든 도메인(음성, 환경음, 음악 등)을 압축하여 모든 오디오의 생성 모델링에 광범위하게 적용할 수 있도록 했습니다. 우리는 경쟁 오디오 압축 알고리즘과 비교하여 우리의 방법이 이를 크게 능가함을 확인했습니다. 모든 설계 선택에 대한 철저한 실험 결과와 함께 오픈소스 코드 및 학습된 모델 가중치를 제공합니다. 우리의 연구가 차세대 고충실도 오디오 모델링의 기반을 마련할 수 있기를 바랍니다.

English

Language models have been successfully used to model natural signals, such as images, speech, and music. A key component of these models is a high quality neural compression model that can compress high-dimensional natural signals into lower dimensional discrete tokens. To that end, we introduce a high-fidelity universal neural audio compression algorithm that achieves ~90x compression of 44.1 KHz audio into tokens at just 8kbps bandwidth. We achieve this by combining advances in high-fidelity audio generation with better vector quantization techniques from the image domain, along with improved adversarial and reconstruction losses. We compress all domains (speech, environment, music, etc.) with a single universal model, making it widely applicable to generative modeling of all audio. We compare with competing audio compression algorithms, and find our method outperforms them significantly. We provide thorough ablations for every design choice, as well as open-source code and trained model weights. We hope our work can lay the foundation for the next generation of high-fidelity audio modeling.

개선된 RVQGAN을 활용한 고품질 오디오 압축

High-Fidelity Audio Compression with Improved RVQGAN

초록

Support