自动回归多模态模型的扩展:预训练和指导调优
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
September 5, 2023
作者: Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, Candace Ross, Adam Polyak, Russell Howes, Vasu Sharma, Puxin Xu, Hovhannes Tamoyan, Oron Ashual, Uriel Singer, Shang-Wen Li, Susan Zhang, Richard James, Gargi Ghosh, Yaniv Taigman, Maryam Fazel-Zarandi, Asli Celikyilmaz, Luke Zettlemoyer, Armen Aghajanyan
cs.AI
摘要
我们介绍了CM3Leon(发音为“Chameleon”),这是一种检索增强、基于标记的、仅解码器的多模态语言模型,能够生成和填充文本和图像。CM3Leon使用了CM3多模态架构,但另外展示了在更多样化的指导式数据上扩展和调整的极端好处。它是第一个使用从仅文本语言模型调整而来的配方进行训练的多模态模型,包括一个大规模的检索增强预训练阶段和第二个多任务监督微调(SFT)阶段。它还是一个通用模型,可以进行文本到图像和图像到文本的生成,使我们能够引入自包含的对比解码方法,产生高质量的输出。大量实验证明,这个配方对于多模态模型非常有效。CM3Leon在文本到图像生成方面实现了最先进的性能,比可比方法少5倍的训练计算(零样本MS-COCO FID为4.88)。经过SFT,CM3Leon还可以展示在任务中前所未有的可控性水平,范围从语言引导的图像编辑到图像控制的生成和分割。
English
We present CM3Leon (pronounced "Chameleon"), a retrieval-augmented,
token-based, decoder-only multi-modal language model capable of generating and
infilling both text and images. CM3Leon uses the CM3 multi-modal architecture
but additionally shows the extreme benefits of scaling up and tuning on more
diverse instruction-style data. It is the first multi-modal model trained with
a recipe adapted from text-only language models, including a large-scale
retrieval-augmented pre-training stage and a second multi-task supervised
fine-tuning (SFT) stage. It is also a general-purpose model that can do both
text-to-image and image-to-text generation, allowing us to introduce
self-contained contrastive decoding methods that produce high-quality outputs.
Extensive experiments demonstrate that this recipe is highly effective for
multi-modal models. CM3Leon achieves state-of-the-art performance in
text-to-image generation with 5x less training compute than comparable methods
(zero-shot MS-COCO FID of 4.88). After SFT, CM3Leon can also demonstrate
unprecedented levels of controllability in tasks ranging from language-guided
image editing to image-controlled generation and segmentation.