交錯式語音-文本語言模型的規模化分析

摘要

現有的語音語言模型（SLM）擴展分析描繪了一幅黯淡的圖景。這些分析預測，與文本相比，SLM需要更多的計算資源和數據，這使得一些人質疑訓練高質量SLM的可行性。然而，現代的SLM通常從預訓練的文本語言模型（TextLM）初始化，並通過語音-文本交織來實現知識轉移。這引發了一個問題——交織式SLM是否比無文本SLM擴展得更高效？在本文中，我們給出了肯定的回答！我們通過訓練數十個交織式SLM並分析其擴展趨勢，進行了擴展分析。我們發現，在這種設置下，SLM在計算資源上的擴展更為高效。此外，我們的結果表明，其擴展動態與無文本SLM有顯著不同，這意味著應將更多的計算預算分配給增加模型規模，而非訓練數據量。我們還研究了合成數據和TextLM模型家族在釋放這一潛力中的作用。結果表明，我們擴展後的模型在語音語義指標上與領先模型表現相當，同時使用的計算資源和數據量少於其他方法。我們開源了模型、樣本和數據——https://pages.cs.huji.ac.il/adiyoss-lab/sims。

English

Existing Speech Language Model (SLM) scaling analysis paints a bleak picture. They predict that SLMs require much more compute and data compared to text, leading some to question the feasibility of training high-quality SLMs. However, modern SLMs are often initialised from pre-trained TextLMs using speech-text interleaving to allow knowledge transfer. This raises the question - Do interleaved SLMs scale more efficiently than textless-SLMs? In this paper we answer a resounding, yes! We conduct scaling analysis of interleaved SLMs by training several dozen and analysing the scaling trends. We see that under this setup SLMs scale more efficiently with compute. Additionally, our results indicate that the scaling-dynamics are significantly different than textless-SLMs, suggesting one should allocate notably more of the compute budget for increasing model size over training tokens. We also study the role of synthetic data and TextLM model families in unlocking this potential. Results suggest, that our scaled up model achieves comparable performance with leading models on speech semantic metrics while using less compute and data than other approaches. We open source models, samples, and data - https://pages.cs.huji.ac.il/adiyoss-lab/sims.

交錯式語音-文本語言模型的規模化分析

Scaling Analysis of Interleaved Speech-Text Language Models

摘要

Support