ChatPaper.aiChatPaper

MOSS-Audio-Tokenizer:為未來音頻基礎模型擴展音頻標記器

MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

February 11, 2026
作者: Yitian Gong, Kuangwei Chen, Zhaoye Fei, Xiaogui Yang, Ke Chen, Yang Wang, Kexin Huang, Mingshu Chen, Ruixiao Li, Qingyuan Cheng, Shimin Li, Xipeng Qiu
cs.AI

摘要

離散音訊標記器是賦予大型語言模型原生音訊處理與生成能力的關鍵基礎。儘管近期有所進展,現有方法往往依賴預訓練編碼器、語義蒸餾或異構的卷積神經網路架構。這些設計引入了固定的歸納偏置,限制了重建保真度並阻礙了有效擴展。本文主張離散音訊標記化應採用同構且可擴展的架構進行完全端到端學習。為此,我們首先提出CAT(基於Transformer的因果音訊標記器),這是一種純Transformer架構,能從頭開始聯合優化編碼器、量化器和解碼器以實現高保真重建。基於CAT架構,我們進一步開發了MOSS-Audio-Tokenizer——一個擁有16億參數的大規模音訊標記器,並在300萬小時的多元通用音訊數據上進行預訓練。我們證明這種基於同構因果Transformer模塊的簡潔端到端方法具備優雅的擴展性,能在各類音訊領域實現高保真重建。在語音、環境聲和音樂場景中,MOSS-Audio-Tokenizer在廣泛碼率範圍內持續超越現有編解碼器,同時展現出隨規模擴大的可預測性能提升。值得注意的是,利用本模型的離散標記,我們開發出首個純自回歸文本轉語音模型,其性能超越先前非自回歸與級聯系統。此外,MOSS-Audio-Tokenizer無需輔助編碼器即可實現競爭性的自動語音識別性能。我們的研究成果將CAT架構確立為新一代原生音訊基礎模型的統一可擴展接口。
English
Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures. These designs introduce fixed inductive biases that limit reconstruction fidelity and hinder effective scaling. In this paper, we argue that discrete audio tokenization should be learned fully end-to-end using a homogeneous and scalable architecture. To this end, we first propose CAT (Causal Audio Tokenizer with Transformer), a purely Transformer-based architecture that jointly optimizes the encoder, quantizer, and decoder from scratch for high-fidelity reconstruction. Building on the CAT architecture, we develop MOSS-Audio-Tokenizer, a large-scale audio tokenizer featuring 1.6 billion parameters, pre-trained on 3 million hours of diverse, general audio data. We show that this simple, fully end-to-end approach built from homogeneous, causal Transformer blocks scales gracefully and supports high-fidelity reconstruction across diverse audio domains. Across speech, sound, and music, MOSS-Audio-Tokenizer consistently outperforms prior codecs over a wide range of bitrates, while exhibiting predictable improvements with increased scale. Notably, leveraging the discrete tokens from our model, we develop the first purely autoregressive TTS model that surpasses prior non-autoregressive and cascaded systems. Furthermore, MOSS-Audio-Tokenizer enables competitive ASR performance without auxiliary encoders. Our findings position the CAT architecture as a unified, scalable interface for the next generation of native audio foundation models.
PDF433February 14, 2026