ChatPaper.aiChatPaper

MOSS-Audio-Tokenizer:面向未来音频基础模型的规模化音频分词器

MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

February 11, 2026
作者: Yitian Gong, Kuangwei Chen, Zhaoye Fei, Xiaogui Yang, Ke Chen, Yang Wang, Kexin Huang, Mingshu Chen, Ruixiao Li, Qingyuan Cheng, Shimin Li, Xipeng Qiu
cs.AI

摘要

离散音频分词器是赋予大语言模型原生音频处理与生成能力的关键基础。尽管近期取得进展,现有方法往往依赖预训练编码器、语义蒸馏或异构的CNN架构。这些设计引入了固定的归纳偏置,限制了重建保真度并阻碍了有效扩展。本文主张离散音频分词应采用完全端到端的方式,通过同质化可扩展架构进行学习。为此,我们首先提出CAT(基于Transformer的因果音频分词器),这是一种纯Transformer架构,能够从头开始联合优化编码器、量化器和解码器以实现高保真重建。基于CAT架构,我们开发了MOSS-Audio-Tokenizer——一个拥有16亿参数的大规模音频分词器,在300万小时多样化通用音频数据上完成预训练。实验表明,这种由同质化因果Transformer模块构建的简单端到端方法具备优雅的扩展性,并在多音频领域支持高保真重建。在语音、环境声和音乐场景下,MOSS-Audio-Tokenizer在多种码率下持续超越现有编解码器,且随规模扩大呈现可预测的性能提升。值得注意的是,利用本模型的离散标记,我们开发出首个纯自回归TTS系统,其性能超越此前非自回归与级联系统。此外,MOSS-Audio-Tokenizer无需辅助编码器即可实现具有竞争力的ASR性能。我们的研究将CAT架构确立为新一代原生音频基础模型的统一可扩展接口。
English
Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures. These designs introduce fixed inductive biases that limit reconstruction fidelity and hinder effective scaling. In this paper, we argue that discrete audio tokenization should be learned fully end-to-end using a homogeneous and scalable architecture. To this end, we first propose CAT (Causal Audio Tokenizer with Transformer), a purely Transformer-based architecture that jointly optimizes the encoder, quantizer, and decoder from scratch for high-fidelity reconstruction. Building on the CAT architecture, we develop MOSS-Audio-Tokenizer, a large-scale audio tokenizer featuring 1.6 billion parameters, pre-trained on 3 million hours of diverse, general audio data. We show that this simple, fully end-to-end approach built from homogeneous, causal Transformer blocks scales gracefully and supports high-fidelity reconstruction across diverse audio domains. Across speech, sound, and music, MOSS-Audio-Tokenizer consistently outperforms prior codecs over a wide range of bitrates, while exhibiting predictable improvements with increased scale. Notably, leveraging the discrete tokens from our model, we develop the first purely autoregressive TTS model that surpasses prior non-autoregressive and cascaded systems. Furthermore, MOSS-Audio-Tokenizer enables competitive ASR performance without auxiliary encoders. Our findings position the CAT architecture as a unified, scalable interface for the next generation of native audio foundation models.
PDF433February 14, 2026