WhisBERT:在1億字的多模態文本音頻語言建模
WhisBERT: Multimodal Text-Audio Language Modeling on 100M Words
December 5, 2023
作者: Lukas Wolf, Klemen Kotar, Greta Tuckute, Eghbal Hosseini, Tamar Regev, Ethan Wilcox, Alex Warstadt
cs.AI
摘要
在多模態輸入上進行訓練可以增強語言模型的能力。在這裡,我們探討這種訓練方式是否能提高這些系統的質量和效率。我們專注於文本-音訊,並引入了Whisbert,受到FLAVA singh_flava_2022文本-圖像方法的啟發。根據Babylm warstadt2023papers的指南,我們在一個數據集上對Whisbert進行預訓練,該數據集僅包含1億單詞及其對應的語音,來自People's Speech數據集galvez_peoples_2021的詞對齊版本。為了評估多模態的影響,我們比較了僅在文本上訓練和同時在音訊和文本上訓練的模型版本。我們發現,雖然Whisbert在多模態遮罩建模上表現良好,在大多數基準任務中超越了Babylm基線,但它在優化其複雜目標並超越僅文本的Whisbert基線方面仍然存在困難。
English
Training on multiple modalities of input can augment the capabilities of a
language model. Here, we ask whether such a training regime can improve the
quality and efficiency of these systems as well. We focus on text--audio and
introduce Whisbert, which is inspired by the text--image approach of FLAVA
singh_flava_2022. In accordance with Babylm warstadt2023papers
guidelines, we pretrain Whisbert on a dataset comprising only 100 million words
plus their corresponding speech from the word-aligned version of the People's
Speech dataset galvez_peoples_2021. To assess the impact of
multimodality, we compare versions of the model that are trained on text only
and on both audio and text simultaneously. We find that while Whisbert is able
to perform well on multimodal masked modeling and surpasses the Babylm
baselines in most benchmark tasks, it struggles to optimize its complex
objective and outperform its text-only Whisbert baseline.