ChatPaper.aiChatPaper

COSMO:具有交替预训练的对比简化多模态模型

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

January 1, 2024
作者: Alex Jinpeng Wang, Linjie Li, Kevin Qinghong Lin, Jianfeng Wang, Kevin Lin, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou
cs.AI

摘要

在视觉-语言预训练的演变过程中,从短文本理解转向包含扩展文本上下文至关重要。最近的自回归视觉-语言模型如flamingo、palme,利用大型语言模型的长上下文能力,在少样本文本生成任务中表现出色,但在对齐任务中面临挑战。为了解决这一差距,我们将对比损失引入到文本生成模型中,提出了对比流畅多模态框架(\ModelName),将语言模型策略性地划分为专门的单模态文本处理和熟练的多模态数据处理组件。我们的统一框架\ModelName融合了单模态和多模态元素,增强了模型在涉及文本和视觉数据的任务中的性能,同时显著减少了可学习参数。然而,这些模型需要大量的长文本数据集,但高质量的长文本视频数据集的可用性仍然有限。为了弥补这一差距,本研究引入了\VideoDatasetName,这是一个首创的交织视频-文本数据集,具有全面的字幕,标志着重要的一步。通过展示其影响,我们说明了如何提高模型在图像-文本任务中的性能。在34%的可学习参数和利用72%可用数据的情况下,我们的模型在性能上明显优于OpenFlamingo~openflamingo。例如,在4-shot flickr字幕任务中,性能从57.2%显著提高到65%。通过在包括图像-文本和视频-文本任务的14个不同下游数据集上显著提高性能,突显了和的贡献。
English
In the evolution of Vision-Language Pre-training, shifting from short-text comprehension to encompassing extended textual contexts is pivotal. Recent autoregressive vision-language models like flamingo, palme, leveraging the long-context capability of Large Language Models, have excelled in few-shot text generation tasks but face challenges in alignment tasks. Addressing this gap, we introduce the contrastive loss into text generation models, presenting the COntrastive-Streamlined MultimOdal framework (\ModelName), strategically partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components. \ModelName, our unified framework, merges unimodal and multimodal elements, enhancing model performance for tasks involving textual and visual data while notably reducing learnable parameters. However, these models demand extensive long-text datasets, yet the availability of high-quality long-text video datasets remains limited. To bridge this gap, this work introduces \VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions, marking a significant step forward. Demonstrating its impact, we illustrate how enhances model performance in image-text tasks. With 34% learnable parameters and utilizing 72\% of the available data, our model demonstrates significant superiority over OpenFlamingo~openflamingo. For instance, in the 4-shot flickr captioning task, performance notably improves from 57.2% to 65.\%. The contributions of and are underscored by notable performance gains across 14 diverse downstream datasets encompassing both image-text and video-text tasks.
PDF172December 15, 2024