JEPA作為神經分詞器:透過密度自適應注意力學習魯棒語音表徵
JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention
December 8, 2025
作者: Georgios Ioannides, Christos Constantinou, Aman Chadha, Aaron Elkins, Linsey Pang, Ravid Shwartz-Ziv, Yann LeCun
cs.AI
摘要
我們提出一個兩階段自監督框架,結合聯合嵌入預測架構(JEPA)與密度自適應注意力機制(DAAM),用於學習魯棒的語音表徵。第一階段採用搭載DAAM的JEPA架構,透過潛在空間中的遮罩預測來學習語義音頻特徵,完全脫離波形重建任務。第二階段利用這些表徵,透過有限標量量化(FSQ)與混合進制打包方案實現高效符記化,再以HiFi-GAN解碼器進行高保真波形重建。通過將基於高斯混合模型的密度自適應門控整合至JEPA編碼器,該模型能以2.5赫茲的低幀率執行自適應時序特徵選擇,並發現層級化語音結構。最終生成的符記(每秒47.5個符記)具備可逆性、高壓縮度及語言模型友好性,其性能不僅可與現有神經音頻編解碼器競爭,且往往更具效率優勢。
English
We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage~1 uses JEPA with DAAM to learn semantic audio features via masked prediction in latent space, fully decoupled from waveform reconstruction. Stage~2 leverages these representations for efficient tokenization using Finite Scalar Quantization (FSQ) and a mixed-radix packing scheme, followed by high-fidelity waveform reconstruction with a HiFi-GAN decoder. By integrating Gaussian mixture-based density-adaptive gating into the JEPA encoder, the model performs adaptive temporal feature selection and discovers hierarchical speech structure at a low frame rate of 2.5~Hz. The resulting tokens (47.5 tokens/sec) provide a reversible, highly compressed, and language-model-friendly representation that is competitive with, and often more efficient than, existing neural audio codecs.