COSMO: 인터리브 프리트레이닝을 통한 대조적 스트림라인드 멀티모달 모델

초록

비전-언어 사전 학습의 진화 과정에서 짧은 텍스트 이해에서 확장된 텍스트 맥락을 포괄하는 것으로의 전환은 매우 중요합니다. Flamingo, Palme와 같은 최신의 자기회귀적 비전-언어 모델들은 대규모 언어 모델의 장문 맥락 처리 능력을 활용하여 소수 샷 텍스트 생성 작업에서 뛰어난 성과를 보였으나, 정렬 작업에서는 어려움을 겪고 있습니다. 이러한 격차를 해결하기 위해, 우리는 텍스트 생성 모델에 대조 손실(contrastive loss)을 도입하고, 언어 모델을 전용 단일 모드 텍스트 처리와 숙련된 다중 모드 데이터 처리 구성 요소로 전략적으로 분할한 COntrastive-Streamlined MultimOdal 프레임워크(\ModelName)를 제시합니다. \ModelName은 우리의 통합 프레임워크로, 단일 모드와 다중 모드 요소를 통합하여 텍스트 및 시각적 데이터를 포함하는 작업에서 모델 성능을 향상시키면서도 학습 가능한 매개변수를 크게 줄입니다. 그러나 이러한 모델들은 광범위한 장문 텍스트 데이터셋을 요구하지만, 고품질의 장문 비디오 텍스트 데이터셋의 가용성은 여전히 제한적입니다. 이러한 격차를 해소하기 위해, 이 연구는 포괄적인 캡션을 특징으로 하는 최초의 인터리브 비디오-텍스트 데이터셋인 \VideoDatasetName을 소개하며, 이는 중요한 진전을 이룬 것입니다. 그 영향력을 입증하기 위해, 우리는 \VideoDatasetName이 이미지-텍스트 작업에서 모델 성능을 어떻게 향상시키는지 보여줍니다. 학습 가능한 매개변수의 34%와 사용 가능한 데이터의 72%를 활용하여, 우리의 모델은 OpenFlamingo~openflamingo에 비해 상당한 우위를 보입니다. 예를 들어, 4-shot Flickr 캡션 작업에서 성능은 57.2%에서 65.\%로 크게 향상되었습니다. \ModelName과 \VideoDatasetName의 기여는 이미지-텍스트 및 비디오-텍스트 작업을 포함한 14개의 다양한 다운스트림 데이터셋에서의 주목할 만한 성능 향상으로 강조됩니다.

English

In the evolution of Vision-Language Pre-training, shifting from short-text comprehension to encompassing extended textual contexts is pivotal. Recent autoregressive vision-language models like flamingo, palme, leveraging the long-context capability of Large Language Models, have excelled in few-shot text generation tasks but face challenges in alignment tasks. Addressing this gap, we introduce the contrastive loss into text generation models, presenting the COntrastive-Streamlined MultimOdal framework (\ModelName), strategically partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components. \ModelName, our unified framework, merges unimodal and multimodal elements, enhancing model performance for tasks involving textual and visual data while notably reducing learnable parameters. However, these models demand extensive long-text datasets, yet the availability of high-quality long-text video datasets remains limited. To bridge this gap, this work introduces \VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions, marking a significant step forward. Demonstrating its impact, we illustrate how enhances model performance in image-text tasks. With 34% learnable parameters and utilizing 72\% of the available data, our model demonstrates significant superiority over OpenFlamingo~openflamingo. For instance, in the 4-shot flickr captioning task, performance notably improves from 57.2% to 65.\%. The contributions of and are underscored by notable performance gains across 14 diverse downstream datasets encompassing both image-text and video-text tasks.

COSMO: 인터리브 프리트레이닝을 통한 대조적 스트림라인드 멀티모달 모델

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

초록

Support