ChatPaper.aiChatPaper

mOSCAR:一個大規模多語言和多模態的文件級語料庫

mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus

June 13, 2024
作者: Matthieu Futeral, Armel Zebaze, Pedro Ortiz Suarez, Julien Abadji, Rémi Lacroix, Cordelia Schmid, Rachel Bawden, Benoît Sagot
cs.AI

摘要

多模式大型語言模型(mLLMs)是通過大量的文本-圖像數據進行訓練的。儘管大多數mLLMs僅在類似標題的數據上進行訓練,Alayrac等人[2022]表明,此外將它們訓練在交錯的文本和圖像序列上可以導致上下文學習能力的出現。然而,他們使用的數據集M3W並非公開,僅為英文。已經有嘗試重現他們的結果,但發布的數據集僅限於英文。相比之下,當前的多語言和多模式數據集要麼僅由類似標題組成,要麼是中等規模或完全私有數據。這限制了對世界上其他7,000種語言的mLLM研究。因此,我們引入了mOSCAR,據我們所知,這是從網絡中爬取的第一個大規模多語言和多模式文檔語料庫。它涵蓋163種語言,3.15億文檔,2140億標記和12億圖像。我們仔細進行了一系列的篩選和評估步驟,以確保mOSCAR足夠安全、多樣化且質量良好。我們另外訓練了兩種類型的多語言模型來證明mOSCAR的好處:(1)一個在mOSCAR的子集和標題數據上進行訓練的模型和(2)僅在標題數據上進行訓練的模型。另外在mOSCAR上進行訓練的模型在各種多語言圖像-文本任務和基準測試中展現出強大的少樣本學習性能提升,這證實了先前對僅限英文mLLMs的發現。
English
Multimodal Large Language Models (mLLMs) are trained on a large amount of text-image data. While most mLLMs are trained on caption-like data only, Alayrac et al. [2022] showed that additionally training them on interleaved sequences of text and images can lead to the emergence of in-context learning capabilities. However, the dataset they used, M3W, is not public and is only in English. There have been attempts to reproduce their results but the released datasets are English-only. In contrast, current multilingual and multimodal datasets are either composed of caption-like only or medium-scale or fully private data. This limits mLLM research for the 7,000 other languages spoken in the world. We therefore introduce mOSCAR, to the best of our knowledge the first large-scale multilingual and multimodal document corpus crawled from the web. It covers 163 languages, 315M documents, 214B tokens and 1.2B images. We carefully conduct a set of filtering and evaluation steps to make sure mOSCAR is sufficiently safe, diverse and of good quality. We additionally train two types of multilingual model to prove the benefits of mOSCAR: (1) a model trained on a subset of mOSCAR and captioning data and (2) a model train on captioning data only. The model additionally trained on mOSCAR shows a strong boost in few-shot learning performance across various multilingual image-text tasks and benchmarks, confirming previous findings for English-only mLLMs.

Summary

AI-Generated Summary

PDF164December 6, 2024