インターリーブされた音声言語モデルはテキスト内で潜在的に動作する

要旨

音声言語モデル(SLMs)は広く研究されており、一般的なパラダイムではテキストデータと事前学習済みテキスト言語モデルを組み込んでいる。主要なアプローチの一つに音声-テキスト混在（speech-text interleaving）があり、これはモデルを音声トークンとテキストトークンの両方を含む系列で訓練し、音声のみの能力さえも向上させることを目的としている。しかし、これら2つのモダリティがモデルの潜在空間でどのように相互作用するかは依然として不明である。本研究では、異なるモデルファミリーやサイズにわたる混在型音声-テキスト言語モデルを、ロジットレンズ（logit lens）の観点から分析し、この洞察を提供する。我々は、これらのモデルが暗黙的な転写フェーズを経ることを明らかにする。このフェーズでは、音声認識のために訓練されていないにもかかわらず、発話された単語のテキストトークンが中間層で復号可能になる。この単語の転写は、データの最大77%において上位候補単語の一つとして現れる。この段階に続いて、モデルはテキスト空間で次の単語を予測し、その後音声領域に変換し直す。最後に、混在データの役割やテキスト言語モデルからの初期化がこの振る舞いを引き出すこと、またこれが音声知識能力とどのように相関するかを分析する。本分析は、音声モダリティとテキストモダリティの関係の根底にある内部メカニズムに光を当て、SLMの最適化に影響を与える可能性がある。

English

Speech language models (SLMs) have been extensively studied, with the common paradigm incorporating text data and pre-trained text LMs. A leading approach is speech-text interleaving in which models are trained over sequences containing both speech and text tokens, aiming to boost even speech-only capabilities. Yet the way these two modalities interact in the model latent space remains unclear. In this work, we analyze interleaved speech-text LMs from different model families and sizes through the scope of the logit lens to provide such insight. We reveal that these models go through an implicit transcription phase in which the text token of the spoken word becomes decodable in intermediate layers, despite not being trained for speech recognition. The transcription of the word appears as one of the top candidate words for as much as 77\% of the data. Following this stage, the models proceed to predict the next word in the text space before transforming back to the speech domain. We finally analyze the role of interleaving data, and initializing from text LMs in eliciting this behavior, as well as seeing how this correlates with spoken knowledge abilities. Our analysis sheds light on the internal mechanisms underlying the relationship between speech and text modalities and could shape SLM optimization.