StableToken: 耐ノイズ性を備えた意味的音声トークナイザによる堅牢な音声LLMの実現

要旨

言語内容を捉えるように設計された一般的な意味的音声トークナイザーは、驚くほど脆弱であることがわかります。意味に関係ない音響的摂動に対して頑健ではなく、音声が完全に明瞭な高い信号対雑音比（SNR）においても、出力されるトークン系列が劇的に変化し、下流の大規模言語モデル（LLM）の学習負荷を増大させます。この不安定性は、2つの欠陥に起因しています：脆弱な単一路量子化アーキテクチャと、中間トークンの安定性に無関心な遠い訓練信号です。この問題を解決するため、我々はStableTokenを導入します。これは、コンセンサス駆動型のメカニズムを通じて安定性を実現するトークナイザーです。そのマルチブランチアーキテクチャは音声を並列処理し、これらの表現は強力なビット単位の投票メカニズムを介して統合され、単一の安定したトークン系列を形成します。StableTokenは、トークン安定性において新たな最先端を確立し、多様なノイズ条件下でのユニット編集距離（UED）を大幅に削減します。この基礎的な安定性は、直接的に下流の利点に変換され、様々なタスクにおけるSpeechLLMの頑健性を大幅に向上させます。

English

Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks.

StableToken: 耐ノイズ性を備えた意味的音声トークナイザによる堅牢な音声LLMの実現

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

要旨

Support