離散音頻標記:不僅僅是一份調查報告!
Discrete Audio Tokens: More Than a Survey!
June 12, 2025
作者: Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli
cs.AI
摘要
離散音頻標記是一種緊湊的表徵形式,旨在保持感知質量、語音內容和說話者特徵的同時,實現高效的存儲和推理,並在多樣化的下游任務中展現競爭力。它們為連續特徵提供了一種實用的替代方案,使得語音和音頻能夠整合到現代大型語言模型(LLMs)中。隨著基於標記的音頻處理興趣的增長,各種標記化方法相繼湧現,多項調查也回顧了該領域的最新進展。然而,現有研究往往聚焦於特定領域或任務,缺乏跨多種基準的統一比較。本文系統性地回顧並基準測試了離散音頻標記器,涵蓋了語音、音樂和通用音頻三個領域。我們基於編碼器-解碼器架構、量化技術、訓練範式、流式處理能力及應用領域,提出了一種標記化方法的分類體系。我們在多個基準上評估了標記器在重建、下游性能及音頻語言建模方面的表現,並通過控制消融研究分析了權衡取捨。我們的研究結果揭示了關鍵限制、實際考量及開放性挑戰,為這一快速發展領域的未來研究提供了洞見與指導。欲了解更多信息,包括我們的主要結果和標記器數據庫,請訪問我們的網站:https://poonehmousavi.github.io/dates-website/。
English
Discrete audio tokens are compact representations that aim to preserve
perceptual quality, phonetic content, and speaker characteristics while
enabling efficient storage and inference, as well as competitive performance
across diverse downstream tasks.They provide a practical alternative to
continuous features, enabling the integration of speech and audio into modern
large language models (LLMs). As interest in token-based audio processing
grows, various tokenization methods have emerged, and several surveys have
reviewed the latest progress in the field. However, existing studies often
focus on specific domains or tasks and lack a unified comparison across various
benchmarks. This paper presents a systematic review and benchmark of discrete
audio tokenizers, covering three domains: speech, music, and general audio. We
propose a taxonomy of tokenization approaches based on encoder-decoder,
quantization techniques, training paradigm, streamability, and application
domains. We evaluate tokenizers on multiple benchmarks for reconstruction,
downstream performance, and acoustic language modeling, and analyze trade-offs
through controlled ablation studies. Our findings highlight key limitations,
practical considerations, and open challenges, providing insight and guidance
for future research in this rapidly evolving area. For more information,
including our main results and tokenizer database, please refer to our website:
https://poonehmousavi.github.io/dates-website/.