WhisBERT：在1億字的多模態文本音頻語言建模

摘要

在多模態輸入上進行訓練可以增強語言模型的能力。在這裡，我們探討這種訓練方式是否能提高這些系統的質量和效率。我們專注於文本-音訊，並引入了Whisbert，受到FLAVA singh_flava_2022文本-圖像方法的啟發。根據Babylm warstadt2023papers的指南，我們在一個數據集上對Whisbert進行預訓練，該數據集僅包含1億單詞及其對應的語音，來自People's Speech數據集galvez_peoples_2021的詞對齊版本。為了評估多模態的影響，我們比較了僅在文本上訓練和同時在音訊和文本上訓練的模型版本。我們發現，雖然Whisbert在多模態遮罩建模上表現良好，在大多數基準任務中超越了Babylm基線，但它在優化其複雜目標並超越僅文本的Whisbert基線方面仍然存在困難。

English

Training on multiple modalities of input can augment the capabilities of a language model. Here, we ask whether such a training regime can improve the quality and efficiency of these systems as well. We focus on text--audio and introduce Whisbert, which is inspired by the text--image approach of FLAVA singh_flava_2022. In accordance with Babylm warstadt2023papers guidelines, we pretrain Whisbert on a dataset comprising only 100 million words plus their corresponding speech from the word-aligned version of the People's Speech dataset galvez_peoples_2021. To assess the impact of multimodality, we compare versions of the model that are trained on text only and on both audio and text simultaneously. We find that while Whisbert is able to perform well on multimodal masked modeling and surpasses the Babylm baselines in most benchmark tasks, it struggles to optimize its complex objective and outperform its text-only Whisbert baseline.

WhisBERT：在1億字的多模態文本音頻語言建模

WhisBERT: Multimodal Text-Audio Language Modeling on 100M Words

摘要

Support