COSMO: インターレーブ事前学習を備えた対照的ストリームラインマルチモーダルモデル

要旨

視覚-言語事前学習の進化において、短いテキストの理解から長文脈の包含へと移行することは極めて重要です。最近の自己回帰型視覚-言語モデル（Flamingo、Palmeなど）は、大規模言語モデルの長文脈能力を活用することで、少数ショットのテキスト生成タスクで優れた性能を発揮していますが、アライメントタスクでは課題に直面しています。このギャップを埋めるため、我々はテキスト生成モデルに対照損失を導入し、COntrastive-Streamlined MultimOdalフレームワーク（\ModelName）を提案します。このフレームワークは、言語モデルを専用の単一モーダルテキスト処理コンポーネントと高度なマルチモーダルデータ処理コンポーネントに戦略的に分割します。\ModelNameは、単一モーダル要素とマルチモーダル要素を統合し、テキストと視覚データを含むタスクにおけるモデル性能を向上させながら、学習可能なパラメータを大幅に削減します。しかし、これらのモデルは大規模な長文テキストデータセットを必要としますが、高品質な長文ビデオデータセットの可用性は依然として限られています。このギャップを埋めるため、本作品は\VideoDatasetNameを紹介します。これは、包括的なキャプションを備えた初のインタリーブ型ビデオ-テキストデータセットであり、重要な一歩を記すものです。その影響を示すため、我々は\VideoDatasetNameが画像-テキストタスクにおけるモデル性能をどのように向上させるかを示します。学習可能なパラメータの34%と利用可能なデータの72%を使用することで、我々のモデルはOpenFlamingo~openflamingoを大幅に上回る性能を示します。例えば、4ショットのFlickrキャプショニングタスクでは、性能が57.2%から65.%に顕著に向上します。\ModelNameと\VideoDatasetNameの貢献は、画像-テキストおよびビデオ-テキストタスクを含む14の多様な下流データセットにおける顕著な性能向上によって強調されています。

English

In the evolution of Vision-Language Pre-training, shifting from short-text comprehension to encompassing extended textual contexts is pivotal. Recent autoregressive vision-language models like flamingo, palme, leveraging the long-context capability of Large Language Models, have excelled in few-shot text generation tasks but face challenges in alignment tasks. Addressing this gap, we introduce the contrastive loss into text generation models, presenting the COntrastive-Streamlined MultimOdal framework (\ModelName), strategically partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components. \ModelName, our unified framework, merges unimodal and multimodal elements, enhancing model performance for tasks involving textual and visual data while notably reducing learnable parameters. However, these models demand extensive long-text datasets, yet the availability of high-quality long-text video datasets remains limited. To bridge this gap, this work introduces \VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions, marking a significant step forward. Demonstrating its impact, we illustrate how enhances model performance in image-text tasks. With 34% learnable parameters and utilizing 72\% of the available data, our model demonstrates significant superiority over OpenFlamingo~openflamingo. For instance, in the 4-shot flickr captioning task, performance notably improves from 57.2% to 65.\%. The contributions of and are underscored by notable performance gains across 14 diverse downstream datasets encompassing both image-text and video-text tasks.

COSMO: インターレーブ事前学習を備えた対照的ストリームラインマルチモーダルモデル

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

要旨

Support