突破固定框架:动态角色对齐的语音分词技术
Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization
January 30, 2026
作者: Luca Della Libera, Cem Subakan, Mirco Ravanelli
cs.AI
摘要
神经音频编解码器是现代对话语音技术的核心,能将连续语音转换为可由大语言模型处理的离散标记序列。然而现有编解码器通常以固定帧率运行,在时间上均匀分配标记并产生过长的序列。本研究提出DyCAST动态字符对齐语音标记器,通过软性字符级对齐和显式时长建模实现可变帧率标记化。DyCAST在训练过程中学习将标记与字符级语言单元关联,并在解码时支持无需对齐的推理,直接控制标记时长。为提升低帧率下的语音重建质量,我们进一步引入检索增强解码机制,在不增加比特率的情况下增强重建保真度。实验表明,DyCAST在使用显著少于固定帧率编解码器的标记数量时,仍能实现具有竞争力的语音重建质量与下游任务性能。代码与模型检查点将公开于https://github.com/lucadellalib/dycast。
English
Neural audio codecs are at the core of modern conversational speech technologies, converting continuous speech into sequences of discrete tokens that can be processed by LLMs. However, existing codecs typically operate at fixed frame rates, allocating tokens uniformly in time and producing unnecessarily long sequences. In this work, we introduce DyCAST, a Dynamic Character-Aligned Speech Tokenizer that enables variable-frame-rate tokenization through soft character-level alignment and explicit duration modeling. DyCAST learns to associate tokens with character-level linguistic units during training and supports alignment-free inference with direct control over token durations at decoding time. To improve speech resynthesis quality at low frame rates, we further introduce a retrieval-augmented decoding mechanism that enhances reconstruction fidelity without increasing bitrate. Experiments show that DyCAST achieves competitive speech resynthesis quality and downstream performance while using significantly fewer tokens than fixed-frame-rate codecs. Code and checkpoints will be released publicly at https://github.com/lucadellalib/dycast.