WhisBERT:在1亿字上进行的多模态文本-音频语言建模
WhisBERT: Multimodal Text-Audio Language Modeling on 100M Words
December 5, 2023
作者: Lukas Wolf, Klemen Kotar, Greta Tuckute, Eghbal Hosseini, Tamar Regev, Ethan Wilcox, Alex Warstadt
cs.AI
摘要
训练多种输入模态可以增强语言模型的能力。在这里,我们探讨这种训练模式是否能够提高这些系统的质量和效率。我们专注于文本-音频,并引入了Whisbert,灵感来自FLAVA singh_flava_2022的文本-图像方法。根据Babylm warstadt2023papers的指导方针,我们在一个数据集上对Whisbert进行预训练,该数据集仅包括1亿个词及其对应的语音,这些语音来自People's Speech数据集galvez_peoples_2021的单词对齐版本。为了评估多模态的影响,我们比较了仅训练文本和同时训练音频和文本的模型版本。我们发现,虽然Whisbert在多模态掩码建模方面表现良好,并在大多数基准任务中超越了Babylm基线,但它在优化其复杂目标并超越仅文本的Whisbert基线方面仍存在困难。
English
Training on multiple modalities of input can augment the capabilities of a
language model. Here, we ask whether such a training regime can improve the
quality and efficiency of these systems as well. We focus on text--audio and
introduce Whisbert, which is inspired by the text--image approach of FLAVA
singh_flava_2022. In accordance with Babylm warstadt2023papers
guidelines, we pretrain Whisbert on a dataset comprising only 100 million words
plus their corresponding speech from the word-aligned version of the People's
Speech dataset galvez_peoples_2021. To assess the impact of
multimodality, we compare versions of the model that are trained on text only
and on both audio and text simultaneously. We find that while Whisbert is able
to perform well on multimodal masked modeling and surpasses the Babylm
baselines in most benchmark tasks, it struggles to optimize its complex
objective and outperform its text-only Whisbert baseline.