WhisBERT: 1억 단어에 대한 다중모드 텍스트-오디오 언어 모델링

초록

여러 입력 양식에 대한 훈련은 언어 모델의 능력을 향상시킬 수 있다. 본 연구에서는 이러한 훈련 방식이 시스템의 품질과 효율성 또한 개선할 수 있는지 탐구한다. 우리는 텍스트-오디오에 초점을 맞추고, FLAVA(singh_flava_2022)의 텍스트-이미지 접근법에서 영감을 받은 Whisbert를 소개한다. Babylm(warstadt2023papers) 지침에 따라, 우리는 Whisbert를 People's Speech 데이터셋(galvez_peoples_2021)의 단어 정렬 버전에서 추출한 1억 단어와 해당 음성으로 구성된 데이터셋으로 사전 훈련시켰다. 다중 양식의 영향을 평가하기 위해, 텍스트만으로 훈련된 모델과 텍스트와 오디오를 동시에 훈련한 모델을 비교하였다. 그 결과, Whisbert는 다중 양식 마스크 모델링에서 우수한 성능을 보이며 대부분의 벤치마크 작업에서 Babylm 기준선을 능가했지만, 복잡한 목적 함수를 최적화하고 텍스트 전용 Whisbert 기준선을 능가하는 데는 어려움을 겪는 것으로 나타났다.

English

Training on multiple modalities of input can augment the capabilities of a language model. Here, we ask whether such a training regime can improve the quality and efficiency of these systems as well. We focus on text--audio and introduce Whisbert, which is inspired by the text--image approach of FLAVA singh_flava_2022. In accordance with Babylm warstadt2023papers guidelines, we pretrain Whisbert on a dataset comprising only 100 million words plus their corresponding speech from the word-aligned version of the People's Speech dataset galvez_peoples_2021. To assess the impact of multimodality, we compare versions of the model that are trained on text only and on both audio and text simultaneously. We find that while Whisbert is able to perform well on multimodal masked modeling and surpasses the Babylm baselines in most benchmark tasks, it struggles to optimize its complex objective and outperform its text-only Whisbert baseline.

WhisBERT: 1억 단어에 대한 다중모드 텍스트-오디오 언어 모델링

WhisBERT: Multimodal Text-Audio Language Modeling on 100M Words

초록

Support