ChatPaper.aiChatPaper

突破静态框架:动态角色对齐的语音分词技术

Beyond Fixed Frames: Dynamic Character-Aligned Speech Tokenization

January 30, 2026
作者: Luca Della Libera, Cem Subakan, Mirco Ravanelli
cs.AI

摘要

神經音頻編解碼器是現代對話語音技術的核心,它將連續語音轉換為可由大型語言模型處理的離散標記序列。然而現有編解碼器通常以固定幀率運行,在時間上均勻分配標記並產生不必要的冗長序列。本研究提出DyCAST——一種動態字符對齊語音標記器,通過軟性字符級對齊和顯式持續時間建模實現可變幀率標記化。DyCAST在訓練過程中學習將標記與字符級語言單元關聯,並支持在解碼時直接控制標記持續時長的無對齊推理。為提升低幀率下的語音重構質量,我們進一步引入檢索增強解碼機制,在不增加比特率的前提下提升重建保真度。實驗表明,DyCAST在使用顯著少於固定幀率編解碼器的標記數量時,仍能實現具有競爭力的語音重構質量與下遊任務表現。代碼與檢查點將公開於https://github.com/lucadellalib/dycast。
English
Neural audio codecs are at the core of modern conversational speech technologies, converting continuous speech into sequences of discrete tokens that can be processed by LLMs. However, existing codecs typically operate at fixed frame rates, allocating tokens uniformly in time and producing unnecessarily long sequences. In this work, we introduce DyCAST, a Dynamic Character-Aligned Speech Tokenizer that enables variable-frame-rate tokenization through soft character-level alignment and explicit duration modeling. DyCAST learns to associate tokens with character-level linguistic units during training and supports alignment-free inference with direct control over token durations at decoding time. To improve speech resynthesis quality at low frame rates, we further introduce a retrieval-augmented decoding mechanism that enhances reconstruction fidelity without increasing bitrate. Experiments show that DyCAST achieves competitive speech resynthesis quality and downstream performance while using significantly fewer tokens than fixed-frame-rate codecs. Code and checkpoints will be released publicly at https://github.com/lucadellalib/dycast.
PDF13February 7, 2026