自动回归多模态模型的扩展：预训练和指导调优

摘要

我们介绍了CM3Leon（发音为“Chameleon”），这是一种检索增强、基于标记的、仅解码器的多模态语言模型，能够生成和填充文本和图像。CM3Leon使用了CM3多模态架构，但另外展示了在更多样化的指导式数据上扩展和调整的极端好处。它是第一个使用从仅文本语言模型调整而来的配方进行训练的多模态模型，包括一个大规模的检索增强预训练阶段和第二个多任务监督微调（SFT）阶段。它还是一个通用模型，可以进行文本到图像和图像到文本的生成，使我们能够引入自包含的对比解码方法，产生高质量的输出。大量实验证明，这个配方对于多模态模型非常有效。CM3Leon在文本到图像生成方面实现了最先进的性能，比可比方法少5倍的训练计算（零样本MS-COCO FID为4.88）。经过SFT，CM3Leon还可以展示在任务中前所未有的可控性水平，范围从语言引导的图像编辑到图像控制的生成和分割。

English

We present CM3Leon (pronounced "Chameleon"), a retrieval-augmented, token-based, decoder-only multi-modal language model capable of generating and infilling both text and images. CM3Leon uses the CM3 multi-modal architecture but additionally shows the extreme benefits of scaling up and tuning on more diverse instruction-style data. It is the first multi-modal model trained with a recipe adapted from text-only language models, including a large-scale retrieval-augmented pre-training stage and a second multi-task supervised fine-tuning (SFT) stage. It is also a general-purpose model that can do both text-to-image and image-to-text generation, allowing us to introduce self-contained contrastive decoding methods that produce high-quality outputs. Extensive experiments demonstrate that this recipe is highly effective for multi-modal models. CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods (zero-shot MS-COCO FID of 4.88). After SFT, CM3Leon can also demonstrate unprecedented levels of controllability in tasks ranging from language-guided image editing to image-controlled generation and segmentation.

自动回归多模态模型的扩展：预训练和指导调优

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

摘要

Support