JEPA作为神经分词器:通过密度自适应注意力学习鲁棒语音表征
JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention
December 8, 2025
作者: Georgios Ioannides, Christos Constantinou, Aman Chadha, Aaron Elkins, Linsey Pang, Ravid Shwartz-Ziv, Yann LeCun
cs.AI
摘要
我们提出了一种两阶段自监督框架,该框架将联合嵌入预测架构(JEPA)与密度自适应注意力机制(DAAM)相结合,用于学习鲁棒的语音表征。第一阶段采用集成DAAM的JEPA,通过潜在空间中的掩码预测学习语义音频特征,完全脱离波形重构。第二阶段利用这些表征,通过有限标量化(FSQ)和混合基数打包方案实现高效标记化,随后通过HiFi-GAN解码器进行高保真波形重建。通过将基于高斯混合的密度自适应门控机制集成到JEPA编码器中,该模型能以2.5Hz的低帧率执行自适应时序特征选择,并发现层次化语音结构。最终生成的标记(47.5标记/秒)具有可逆性、高压缩度和语言模型友好性,其性能与现有神经音频编解码器相当,且通常更为高效。
English
We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage~1 uses JEPA with DAAM to learn semantic audio features via masked prediction in latent space, fully decoupled from waveform reconstruction. Stage~2 leverages these representations for efficient tokenization using Finite Scalar Quantization (FSQ) and a mixed-radix packing scheme, followed by high-fidelity waveform reconstruction with a HiFi-GAN decoder. By integrating Gaussian mixture-based density-adaptive gating into the JEPA encoder, the model performs adaptive temporal feature selection and discovers hierarchical speech structure at a low frame rate of 2.5~Hz. The resulting tokens (47.5 tokens/sec) provide a reversible, highly compressed, and language-model-friendly representation that is competitive with, and often more efficient than, existing neural audio codecs.