COSMO：具有交錯預訓練的對比簡化多模型

摘要

在視覺-語言預訓練的演進中，從短文本理解轉向包含擴展文本上下文是至關重要的。最近的自回歸視覺-語言模型，如flamingo、palme，利用大型語言模型的長文本能力，在少樣本文本生成任務中表現出色，但在對齊任務中面臨挑戰。為彌補這一差距，我們將對比損失引入文本生成模型，提出了對比優化多模態框架（\ModelName），將語言模型策略性地劃分為專用的單模態文本處理和熟練的多模態數據處理組件。\ModelName，我們的統一框架，融合了單模態和多模態元素，增強了模型在涉及文本和視覺數據的任務中的性能，同時顯著減少了可學習參數。然而，這些模型需要大量的長文本數據集，但高質量的長文本視頻數據集的可用性仍然有限。為彌合這一差距，本研究引入了\VideoDatasetName，這是一個首創的交錯視頻-文本數據集，具有全面的標題，標誌著一個重大進步。通過展示其影響，我們說明了如何在圖像-文本任務中提升模型性能。我們的模型具有34%的可學習參數，利用了72%的可用數據，顯著優於OpenFlamingo~openflamingo。例如，在4-shot flickr標題任務中，性能從57.2%顯著提高到65%。在跨越14個不同的下游數據集，包括圖像-文本和視頻-文本任務時，\ModelName 和 \VideoDatasetName 的貢獻通過在性能上的顯著增益得到了強調。

English

In the evolution of Vision-Language Pre-training, shifting from short-text comprehension to encompassing extended textual contexts is pivotal. Recent autoregressive vision-language models like flamingo, palme, leveraging the long-context capability of Large Language Models, have excelled in few-shot text generation tasks but face challenges in alignment tasks. Addressing this gap, we introduce the contrastive loss into text generation models, presenting the COntrastive-Streamlined MultimOdal framework (\ModelName), strategically partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components. \ModelName, our unified framework, merges unimodal and multimodal elements, enhancing model performance for tasks involving textual and visual data while notably reducing learnable parameters. However, these models demand extensive long-text datasets, yet the availability of high-quality long-text video datasets remains limited. To bridge this gap, this work introduces \VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions, marking a significant step forward. Demonstrating its impact, we illustrate how enhances model performance in image-text tasks. With 34% learnable parameters and utilizing 72\% of the available data, our model demonstrates significant superiority over OpenFlamingo~openflamingo. For instance, in the 4-shot flickr captioning task, performance notably improves from 57.2% to 65.\%. The contributions of and are underscored by notable performance gains across 14 diverse downstream datasets encompassing both image-text and video-text tasks.

COSMO：具有交錯預訓練的對比簡化多模型

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

摘要

Support