StableToken: 강인한 SpeechLLM을 위한 잡음에 강건한 의미론적 음성 토크나이저

초록

언어적 내용을 포착하도록 설계된 기존의 의미론적 음성 토크나이저는 놀랍도록 취약한 것으로 나타났습니다. 우리는 이러한 토크나이저가 의미와 무관한 음향적 변화에 강건하지 않다는 것을 발견했습니다. 음성이 완벽하게 명료한 높은 신호 대 잡음비(SNR)에서도, 이들의 출력 토큰 시퀀스는 크게 변할 수 있으며, 이는 다운스트림 대형 언어 모델(LLM)의 학습 부담을 증가시킵니다. 이러한 불안정성은 두 가지 결함에서 비롯됩니다: 취약한 단일 경로 양자화 아키텍처와 중간 토큰 안정성에 무관심한 원거리 학습 신호. 이를 해결하기 위해, 우리는 StableToken이라는 토크나이저를 소개합니다. StableToken은 합의 기반 메커니즘을 통해 안정성을 달성합니다. 이 토크나이저는 다중 분기 아키텍처를 통해 오디오를 병렬로 처리하고, 이러한 표현들은 강력한 비트 단위 투표 메커니즘을 통해 통합되어 단일의 안정적인 토큰 시퀀스를 형성합니다. StableToken은 토큰 안정성에서 새로운 최첨단 기술을 제시하며, 다양한 잡음 조건에서 단위 편집 거리(UED)를 크게 줄입니다. 이러한 기본적인 안정성은 직접적으로 다운스트림 이점으로 이어져, 다양한 작업에서 SpeechLLM의 강건성을 크게 향상시킵니다.

English

Prevalent semantic speech tokenizers, designed to capture linguistic content, are surprisingly fragile. We find they are not robust to meaning-irrelevant acoustic perturbations; even at high Signal-to-Noise Ratios (SNRs) where speech is perfectly intelligible, their output token sequences can change drastically, increasing the learning burden for downstream LLMs. This instability stems from two flaws: a brittle single-path quantization architecture and a distant training signal indifferent to intermediate token stability. To address this, we introduce StableToken, a tokenizer that achieves stability through a consensus-driven mechanism. Its multi-branch architecture processes audio in parallel, and these representations are merged via a powerful bit-wise voting mechanism to form a single, stable token sequence. StableToken sets a new state-of-the-art in token stability, drastically reducing Unit Edit Distance (UED) under diverse noise conditions. This foundational stability translates directly to downstream benefits, significantly improving the robustness of SpeechLLMs on a variety of tasks.

StableToken: 강인한 SpeechLLM을 위한 잡음에 강건한 의미론적 음성 토크나이저

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

초록

Support