TOWER에서 SPIRE로: 텍스트 전용 LLM에 음성 모달리티 추가하기

초록

대규모 언어 모델(LLM)은 다국어 및 다양한 작업에서 뛰어난 성능과 일반화 능력을 보여주며, 이를 통해 이미지나 음성과 같은 다중 모달리티 통합의 매력적인 대상으로 부상하고 있습니다. 본 연구에서는 기존 LLM을 음성 모달리티로 확장하기 위해 음성 이산화와 지속적인 사전 학습을 적용합니다. 특히, TOWER와 같은 다국어 LLM에 주목하는데, 이는 사전 학습 설정에서 이산화된 음성 입력을 추가 번역 언어로 취급할 수 있기 때문입니다. 그 결과로 개발된 오픈소스 모델인 SPIRE는 영어 음성 입력을 전사하고 번역할 수 있으며, TOWER의 원래 번역 관련 작업 성능을 유지합니다. 이는 LLM 적응 과정에서 이산화된 음성 입력을 추가 언어로 통합하는 것이 가능함을 보여줍니다. 우리는 코드와 모델을 커뮤니티에 공개합니다.

English

Large language models (LLMs) have shown remarkable performance and generalization capabilities across multiple languages and tasks, making them very attractive targets for multi-modality integration (e.g., images or speech). In this work, we extend an existing LLM to the speech modality via speech discretization and continued pre-training. In particular, we are interested in multilingual LLMs, such as TOWER, as their pre-training setting allows us to treat discretized speech input as an additional translation language. The resulting open-source model, SPIRE, is able to transcribe and translate English speech input while maintaining TOWER's original performance on translation-related tasks, showcasing that discretized speech input integration as an additional language is feasible during LLM adaptation. We make our code and models available to the community.

TOWER에서 SPIRE로: 텍스트 전용 LLM에 음성 모달리티 추가하기

From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM

초록

Support