大規模オムニモーダル事前学習の限界を探る

要旨

我々は、あらゆるモダリティを理解し普遍的な表現を学習可能なオムニモーダル知能の構築を提案する。具体的には、Multimodal Context（MiCo）と名付けたスケーラブルな事前学習パラダイムを提案し、事前学習プロセスにおいてモダリティ数、データ量、モデルパラメータを同時にスケールアップすることが可能である。MiCoを用いることで、事前学習済みモデルはマルチモーダル学習において顕著な創発能力を示し、以下のタスクで評価を行った：i) 10種類の異なるモダリティにおける単一モダリティ知覚ベンチマーク、ii) 検索、質問応答、キャプション生成を含む25のクロスモーダル理解タスク、iii) 18のマルチモーダル大規模言語モデルベンチマーク。我々のモデルは、37の新たなstate-of-the-art性能記録を樹立した。本研究がオムニモーダル知能の発展に寄与することを期待する。コードとモデルはhttps://github.com/invictus717/MiCoで公開している。

English

We propose to build omni-modal intelligence, which is capable of understanding any modality and learning universal representations. In specific, we propose a scalable pretraining paradigm, named Multimodal Context (MiCo), which can scale up the numbers of modalities and amount of data, together with the model parameters, in the pretraining process. With MiCo, the pretrained models show significant emergent abilities in multimodal learning, which are evaluated on the following tasks: i) single-modality perception benchmarks of 10 different modalities, ii) 25 cross-modality understanding tasks of retrieval, question-answering, captioning, and iii) 18 multimodal large language model benchmarks. Our models establish 37 new records for state-of-the-art performance. We hope that our research could contribute to the development of omni-modal intelligence. Code and Models are at https://github.com/invictus717/MiCo

大規模オムニモーダル事前学習の限界を探る

Explore the Limits of Omni-modal Pretraining at Scale

要旨

Support