擴展自回歸多模型：預訓練與指導調整

摘要

我們介紹了CM3Leon（發音為“Chameleon”），這是一種檢索增強、基於標記的、僅解碼器的多模態語言模型，能夠生成和填充文本和圖像。CM3Leon使用了CM3多模態架構，但同時展示了在更多多樣化指導式數據上進行擴展和調整的極大好處。這是第一個使用從僅文本語言模型調整而來的配方進行訓練的多模態模型，包括大規模檢索增強的預訓練階段和第二個多任務監督微調（SFT）階段。它也是一個通用模型，可以進行文本到圖像和圖像到文本的生成，使我們能夠引入自包含的對比解碼方法，產生高質量的輸出。大量實驗表明，這種配方對多模態模型非常有效。CM3Leon在文本到圖像生成方面實現了最先進的性能，比可比方法少5倍的訓練計算（零樣本MS-COCO FID為4.88）。在SFT之後，CM3Leon還可以展示在任務中從語言引導的圖像編輯到圖像控制生成和分割等任務中前所未有的可控性水平。

English

We present CM3Leon (pronounced "Chameleon"), a retrieval-augmented, token-based, decoder-only multi-modal language model capable of generating and infilling both text and images. CM3Leon uses the CM3 multi-modal architecture but additionally shows the extreme benefits of scaling up and tuning on more diverse instruction-style data. It is the first multi-modal model trained with a recipe adapted from text-only language models, including a large-scale retrieval-augmented pre-training stage and a second multi-task supervised fine-tuning (SFT) stage. It is also a general-purpose model that can do both text-to-image and image-to-text generation, allowing us to introduce self-contained contrastive decoding methods that produce high-quality outputs. Extensive experiments demonstrate that this recipe is highly effective for multi-modal models. CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods (zero-shot MS-COCO FID of 4.88). After SFT, CM3Leon can also demonstrate unprecedented levels of controllability in tasks ranging from language-guided image editing to image-controlled generation and segmentation.

擴展自回歸多模型：預訓練與指導調整

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

摘要

Support