離散音頻標記：不僅僅是一份調查報告！

摘要

離散音頻標記是一種緊湊的表徵形式，旨在保持感知質量、語音內容和說話者特徵的同時，實現高效的存儲和推理，並在多樣化的下游任務中展現競爭力。它們為連續特徵提供了一種實用的替代方案，使得語音和音頻能夠整合到現代大型語言模型（LLMs）中。隨著基於標記的音頻處理興趣的增長，各種標記化方法相繼湧現，多項調查也回顧了該領域的最新進展。然而，現有研究往往聚焦於特定領域或任務，缺乏跨多種基準的統一比較。本文系統性地回顧並基準測試了離散音頻標記器，涵蓋了語音、音樂和通用音頻三個領域。我們基於編碼器-解碼器架構、量化技術、訓練範式、流式處理能力及應用領域，提出了一種標記化方法的分類體系。我們在多個基準上評估了標記器在重建、下游性能及音頻語言建模方面的表現，並通過控制消融研究分析了權衡取捨。我們的研究結果揭示了關鍵限制、實際考量及開放性挑戰，為這一快速發展領域的未來研究提供了洞見與指導。欲了解更多信息，包括我們的主要結果和標記器數據庫，請訪問我們的網站：https://poonehmousavi.github.io/dates-website/。

English

Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks.They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.

離散音頻標記：不僅僅是一份調查報告！

Discrete Audio Tokens: More Than a Survey!

摘要

Support