ChatPaper.aiChatPaper

UniAudio 2.0:基于文本对齐因子化音频标记的统一音频语言模型

UniAudio 2.0: A Unified Audio Language Model with Text-Aligned Factorized Audio Tokenization

February 4, 2026
作者: Dongchao Yang, Yuanyuan Wang, Dading Chong, Songxiang Liu, Xixin Wu, Helen Meng
cs.AI

摘要

我们研究音频语言模型中的两个基础性问题:(1)如何设计能同时服务于理解与生成的音频标记化中间表征;(2)如何构建能像大语言模型那样实现少样本/零样本泛化的音频基础模型。为此我们做出两项贡献:首先提出ReasoningCodec离散音频编解码器,将音频分解为(i)推理标记——编码文本对齐的高层分析与规划表征,用于音频理解与分层生成;(ii)重建标记——编码语义丰富的声学线索,实现高保真波形重建。该设计在理解性能上媲美强连续表征,同时在生成质量与重建保真度上超越现有离散标记器。其次,我们引入面向文本与音频的统一自回归架构,结合多阶段训练与多任务数据构建。基于此框架,我们在1000亿文本标记与600亿音频标记上训练UniAudio 2.0模型。在语音、声效与音乐等广泛任务中,UniAudio 2.0在领域内评估表现优异,并对未见任务展现出强大的少样本/零样本泛化能力。演示、代码与模型权重详见https://dongchaoyang.top/UniAudio2Demo/。
English
We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot settings, analogous to large language models. To this end, we make the following two contributions. First, we propose ReasoningCodec, a discrete audio codec that factorizes audio into (i) reasoning tokens, which encode text-aligned, high-level analysis and planning representations for audio understanding and hierarchical generation, and (ii) reconstruction tokens, which encode semantic-rich acoustic cues for high-fidelity waveform reconstruction. This design achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior discrete tokenizers. Second, we introduce a unified autoregressive architecture for text and audio, together with multi-stage training and multi-task data construction. Using this framework, we train UniAudio 2.0 on 100B text tokens and 60B audio tokens. Across a wide range of speech, sound, and music tasks, UniAudio 2.0 performs competitively on in-domain evaluations and demonstrates strong few-shot and zero-shot generalization to unseen tasks. Demo, code, and checkpoints will be available at https://dongchaoyang.top/UniAudio2Demo/{https://dongchaoyang.top/UniAudio2Demo/}.
PDF12February 7, 2026