StableToken：一種抗噪語義語音分詞器，用於構建穩健的語音大語言模型

摘要

現有的語義語音分詞器，旨在捕捉語言內容，卻意外地脆弱。我們發現它們對語義無關的聲學擾動並不具備魯棒性；即使在信噪比（SNR）極高、語音完全可理解的情況下，其輸出的分詞序列也可能發生劇烈變化，從而增加了下游大型語言模型（LLMs）的學習負擔。這種不穩定性源於兩個缺陷：脆弱的單路徑量化架構以及對中間分詞穩定性漠不關心的遠程訓練信號。為解決這一問題，我們引入了StableToken，這是一種通過共識驅動機制實現穩定性的分詞器。其多分支架構並行處理音頻，並通過強大的位級投票機制合併這些表示，形成單一且穩定的分詞序列。StableToken在分詞穩定性方面樹立了新的技術標杆，在多種噪聲條件下大幅降低了單位編輯距離（UED）。這種基礎穩定性直接轉化為下游優勢，顯著提升了SpeechLLMs在各種任務上的魯棒性。

English

Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks.

StableToken：一種抗噪語義語音分詞器，用於構建穩健的語音大語言模型

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

摘要

Support