ChatPaper.aiChatPaper

UniAudio 2.0:基于文本对齐因子化音频标记的统一音频语言模型

UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization

February 4, 2026
作者: Dongchao Yang, Yuanyuan Wang, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng
cs.AI

摘要

我们研究了音频语言模型中的两个基础性问题:(1)如何设计能同时服务于理解与生成的音频表征中间件;(2)如何构建具备小样本与零样本泛化能力的音频基础模型,使其达到类似大语言模型的通用性。为此我们做出两项核心贡献:首先提出ReasoningCodec离散音频编解码器,将音频分解为(i)推理令牌——编码文本对齐的高层分析与规划表征,用于音频理解与分层生成;(ii)重建令牌——编码语义丰富的声学线索,实现高保真波形重建。该设计在理解性能上媲美强连续表征,同时在生成质量与重建保真度方面超越现有离散表征方法。其次,我们构建了文本与音频统一的自回归架构,结合多阶段训练与多任务数据构建方案。基于此框架训练的UniAudio 2.0模型使用1000亿文本令牌与600亿音频令牌,在语音、声效及音乐等广泛任务中,不仅域内评估表现优异,更在未见任务上展现出强大的小样本与零样本泛化能力。演示资源、代码与模型权重详见https://dongchaoyang.top/UniAudio2Demo/。
English
We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot settings, analogous to large language models. To this end, we make the following two contributions. First, we propose ReasoningCodec, a discrete audio codec that factorizes audio into (i) reasoning tokens, which encode text-aligned, high-level analysis and planning representations for audio understanding and hierarchical generation, and (ii) reconstruction tokens, which encode semantic-rich acoustic cues for high-fidelity waveform reconstruction. This design achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior discrete tokenizers. Second, we introduce a unified autoregressive architecture for text and audio, together with multi-stage training and multi-task data construction. Using this framework, we train UniAudio 2.0 on 100B text tokens and 60B audio tokens. Across a wide range of speech, sound, and music tasks, UniAudio 2.0 performs competitively on in-domain evaluations and demonstrates strong few-shot and zero-shot generalization to unseen tasks. Demo, code, and checkpoints will be available at https://dongchaoyang.top/UniAudio2Demo/{https://dongchaoyang.top/UniAudio2Demo/}.
PDF12February 7, 2026