ChatPaper.aiChatPaper

StableToken:一种面向鲁棒语音大模型的抗噪语义语音分词器

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

September 26, 2025
作者: Yuhan Song, Linhao Zhang, Chuhan Wu, Aiwei Liu, Wei Jia, Houfeng Wang, Xiao Zhou
cs.AI

摘要

当前主流的语义语音分词器,旨在捕捉语言内容,却意外地表现出脆弱性。我们发现,这些分词器对与意义无关的声学扰动缺乏鲁棒性;即便在高信噪比(SNR)下,语音清晰可辨,其输出的分词序列也可能发生剧烈变化,从而加重了下游大语言模型(LLMs)的学习负担。这种不稳定性源于两大缺陷:一是脆弱的单路径量化架构,二是对中间分词稳定性漠不关心的远程训练信号。为解决这一问题,我们提出了StableToken,一种通过共识驱动机制实现稳定性的分词器。其多分支架构并行处理音频,并通过强大的位级投票机制融合这些表示,生成单一且稳定的分词序列。StableToken在分词稳定性方面树立了新的标杆,显著降低了多种噪声条件下的单元编辑距离(UED)。这一基础稳定性直接转化为下游优势,大幅提升了语音大语言模型(SpeechLLMs)在各类任务中的鲁棒性。
English
Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks.
PDF572September 30, 2025