인터리브드 음성-텍스트 언어 모델의 스케일링 분석

초록

기존의 음성 언어 모델(SLM) 스케일링 분석은 암울한 전망을 그려냅니다. 이 분석에 따르면, SLM은 텍스트에 비해 훨씬 더 많은 컴퓨팅 자원과 데이터를 필요로 하며, 이로 인해 고품질 SLM을 학습시키는 것이 실현 가능한지에 대한 의문이 제기되고 있습니다. 그러나 현대의 SLM은 종종 사전 학습된 텍스트 언어 모델(TextLM)에서 초기화되며, 음성-텍스트 인터리빙을 통해 지식 전달을 가능하게 합니다. 이는 다음과 같은 질문을 제기합니다: 인터리빙된 SLM이 텍스트 없는 SLM보다 더 효율적으로 스케일링되는가? 본 논문에서 우리는 이 질문에 확실히 '그렇다'고 답합니다! 우리는 인터리빙된 SLM의 스케일링 분석을 수행하기 위해 수십 개의 모델을 학습시키고 스케일링 경향을 분석했습니다. 이 설정 하에서 SLM은 컴퓨팅 자원에 대해 더 효율적으로 스케일링되는 것을 확인했습니다. 또한, 우리의 결과는 스케일링 역학이 텍스트 없는 SLM과 크게 다르며, 모델 크기를 늘리는 데 더 많은 컴퓨팅 예산을 할당해야 함을 시사합니다. 우리는 또한 합성 데이터와 TextLM 모델 패밀리가 이 잠재력을 발휘하는 데 어떤 역할을 하는지 연구했습니다. 결과에 따르면, 우리의 스케일업된 모델은 다른 접근 방식보다 더 적은 컴퓨팅 자원과 데이터를 사용하면서도 음성 의미론적 지표에서 선두 모델과 비슷한 성능을 달성했습니다. 우리는 모델, 샘플, 데이터를 오픈소스로 공개합니다 - https://pages.cs.huji.ac.il/adiyoss-lab/sims.

English

Existing Speech Language Model (SLM) scaling analysis paints a bleak picture. They predict that SLMs require much more compute and data compared to text, leading some to question the feasibility of training high-quality SLMs. However, modern SLMs are often initialised from pre-trained TextLMs using speech-text interleaving to allow knowledge transfer. This raises the question - Do interleaved SLMs scale more efficiently than textless-SLMs? In this paper we answer a resounding, yes! We conduct scaling analysis of interleaved SLMs by training several dozen and analysing the scaling trends. We see that under this setup SLMs scale more efficiently with compute. Additionally, our results indicate that the scaling-dynamics are significantly different than textless-SLMs, suggesting one should allocate notably more of the compute budget for increasing model size over training tokens. We also study the role of synthetic data and TextLM model families in unlocking this potential. Results suggest, that our scaled up model achieves comparable performance with leading models on speech semantic metrics while using less compute and data than other approaches. We open source models, samples, and data - https://pages.cs.huji.ac.il/adiyoss-lab/sims.

인터리브드 음성-텍스트 언어 모델의 스케일링 분석

Scaling Analysis of Interleaved Speech-Text Language Models

초록

Support