StableToken:一種抗噪語義語音分詞器,用於構建穩健的語音大語言模型
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
September 26, 2025
作者: Yuhan Song, Linhao Zhang, Chuhan Wu, Aiwei Liu, Wei Jia, Houfeng Wang, Xiao Zhou
cs.AI
摘要
現有的語義語音分詞器,旨在捕捉語言內容,卻意外地脆弱。我們發現它們對語義無關的聲學擾動並不具備魯棒性;即使在信噪比(SNR)極高、語音完全可理解的情況下,其輸出的分詞序列也可能發生劇烈變化,從而增加了下游大型語言模型(LLMs)的學習負擔。這種不穩定性源於兩個缺陷:脆弱的單路徑量化架構以及對中間分詞穩定性漠不關心的遠程訓練信號。為解決這一問題,我們引入了StableToken,這是一種通過共識驅動機制實現穩定性的分詞器。其多分支架構並行處理音頻,並通過強大的位級投票機制合併這些表示,形成單一且穩定的分詞序列。StableToken在分詞穩定性方面樹立了新的技術標杆,在多種噪聲條件下大幅降低了單位編輯距離(UED)。這種基礎穩定性直接轉化為下游優勢,顯著提升了SpeechLLMs在各種任務上的魯棒性。
English
Prevalent semantic speech tokenizers, designed to capture linguistic content,
are surprisingly fragile. We find they are not robust to meaning-irrelevant
acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech
is perfectly intelligible, their output token sequences can change drastically,
increasing the learning burden for downstream LLMs. This instability stems from
two flaws: a brittle single-path quantization architecture and a distant
training signal indifferent to intermediate token stability. To address this,
we introduce StableToken, a tokenizer that achieves stability through a
consensus-driven mechanism. Its multi-branch architecture processes audio in
parallel, and these representations are merged via a powerful bit-wise voting
mechanism to form a single, stable token sequence. StableToken sets a new
state-of-the-art in token stability, drastically reducing Unit Edit Distance
(UED) under diverse noise conditions. This foundational stability translates
directly to downstream benefits, significantly improving the robustness of
SpeechLLMs on a variety of tasks.