ChatPaper.aiChatPaper

WavTokenizer:一种面向音频语言建模的高效声学离散编码分词器

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

August 29, 2024
作者: Shengpeng Ji, Ziyue Jiang, Xize Cheng, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Ruiqi Li, Ziang Zhang, Xiaoda Yang, Rongjie Huang, Yidi Jiang, Qian Chen, Siqi Zheng, Wen Wang, Zhou Zhao
cs.AI

摘要

语言模型已成功应用于图像、视频、语音和音频等自然信号的建模。这些模型的核心组件是编解码器分词器,它能将高维自然信号压缩为低维离散标记。本文提出的WavTokenizer在音频领域相比此前SOTA声学编解码模型具有以下优势:1)极致压缩。通过量化器层级与离散编码时间维度的联合压缩,24kHz采样率的1秒音频仅需单个量化器生成40或75个标记;2)提升主观质量。在减少标记数量的同时,WavTokenizer凭借优异的UTMOS评分实现SOTA重建质量,且内蕴更丰富的语义信息。具体而言,我们通过设计更广阔的向量量化空间、扩展上下文窗口、改进注意力网络,并引入强大的多尺度判别器与逆傅里叶变换结构达成上述成果。我们在语音、音频和音乐领域开展了大规模重建实验,WavTokenizer在各类主客观指标上均优于现有最优模型。我们还测试了语义信息表征能力、向量量化利用率及生成模型适配性,详尽的消融实验验证了WavTokenizer各模块的必要性。相关代码、演示及预训练模型已发布于https://github.com/jishengpeng/WavTokenizer。
English
Language models have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer, which offers several advantages over previous SOTA acoustic codec models in the audio domain: 1)extreme compression. By compressing the layers of quantizers and the temporal dimension of the discrete codec, one-second audio of 24kHz sampling rate requires only a single quantizer with 40 or 75 tokens. 2)improved subjective quality. Despite the reduced number of tokens, WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information. Specifically, we achieve these results by designing a broader VQ space, extended contextual windows, and improved attention networks, as well as introducing a powerful multi-scale discriminator and an inverse Fourier transform structure. We conducted extensive reconstruction experiments in the domains of speech, audio, and music. WavTokenizer exhibited strong performance across various objective and subjective metrics compared to state-of-the-art models. We also tested semantic information, VQ utilization, and adaptability to generative models. Comprehensive ablation studies confirm the necessity of each module in WavTokenizer. The related code, demos, and pre-trained models are available at https://github.com/jishengpeng/WavTokenizer.
PDF504November 14, 2024