WavTokenizer:一種用於音頻語言建模的高效聲學離散編碼標記器
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
August 29, 2024
作者: Shengpeng Ji, Ziyue Jiang, Xize Cheng, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Ruiqi Li, Ziang Zhang, Xiaoda Yang, Rongjie Huang, Yidi Jiang, Qian Chen, Siqi Zheng, Wen Wang, Zhou Zhao
cs.AI
摘要
語言模型已被有效應用於建模自然信號,如圖像、視頻、語音和音頻。這些模型的一個關鍵組件是編碼令牌化器,它將高維自然信號壓縮為低維離散令牌。本文介紹了WavTokenizer,相對於先前在音頻領域的SOTA聲學編碼模型,它具有幾個優勢:1)極端壓縮。通過壓縮量化器的層和離散編碼的時間維度,24kHz採樣率的一秒音頻僅需要一個具有40或75個令牌的量化器。2)改善主觀質量。儘管令牌數量減少,WavTokenizer實現了具有傑出UTMOS分數的最先進的重建質量,並且內在包含更豐富的語義信息。具體來說,我們通過設計更廣泛的VQ空間、擴展的上下文窗口、改進的注意網絡,以及引入強大的多尺度鑑別器和反傅立葉變換結構來實現這些結果。我們在語音、音頻和音樂領域進行了廣泛的重建實驗。與最先進的模型相比,WavTokenizer在各種客觀和主觀指標上表現出色。我們還測試了語義信息、VQ利用率和對生成模型的適應性。全面的消融研究確認了WavTokenizer中每個模塊的必要性。相關代碼、演示和預訓練模型可在https://github.com/jishengpeng/WavTokenizer找到。
English
Language models have been effectively applied to modeling natural signals,
such as images, video, speech, and audio. A crucial component of these models
is the codec tokenizer, which compresses high-dimensional natural signals into
lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer,
which offers several advantages over previous SOTA acoustic codec models in the
audio domain: 1)extreme compression. By compressing the layers of quantizers
and the temporal dimension of the discrete codec, one-second audio of 24kHz
sampling rate requires only a single quantizer with 40 or 75 tokens. 2)improved
subjective quality. Despite the reduced number of tokens, WavTokenizer achieves
state-of-the-art reconstruction quality with outstanding UTMOS scores and
inherently contains richer semantic information. Specifically, we achieve these
results by designing a broader VQ space, extended contextual windows, and
improved attention networks, as well as introducing a powerful multi-scale
discriminator and an inverse Fourier transform structure. We conducted
extensive reconstruction experiments in the domains of speech, audio, and
music. WavTokenizer exhibited strong performance across various objective and
subjective metrics compared to state-of-the-art models. We also tested semantic
information, VQ utilization, and adaptability to generative models.
Comprehensive ablation studies confirm the necessity of each module in
WavTokenizer. The related code, demos, and pre-trained models are available at
https://github.com/jishengpeng/WavTokenizer.