자동회귀 다중모달 모델의 확장: 사전 학습과 지시 튜닝

초록

CM3Leon("카멜레온"으로 발음)은 텍스트와 이미지의 생성 및 삽입이 가능한 검색 강화 토큰 기반 디코더 전용 멀티모달 언어 모델을 소개합니다. CM3Leon은 CM3 멀티모달 아키텍처를 사용하지만, 더 다양한 명령 스타일 데이터에 대한 확장 및 튜닝의 극적인 이점을 추가로 보여줍니다. 이는 텍스트 전용 언어 모델에서 적응된 레시피로 훈련된 최초의 멀티모달 모델로, 대규모 검색 강화 사전 훈련 단계와 두 번째 다중 작업 지도 미세 조정(SFT) 단계를 포함합니다. 또한 텍스트-이미지 및 이미지-텍스트 생성 모두를 수행할 수 있는 범용 모델로서, 고품질 출력을 생성하는 자체 포함형 대조 디코딩 방법을 도입할 수 있게 합니다. 광범위한 실험을 통해 이 레시피가 멀티모달 모델에 매우 효과적임을 입증했습니다. CM3Leon은 유사한 방법들보다 5배 적은 훈련 계산량으로 텍스트-이미지 생성에서 최첨단 성능을 달성합니다(제로샷 MS-COCO FID 4.88). SFT 이후, CM3Leon은 언어 기반 이미지 편집부터 이미지 제어 생성 및 세분화에 이르는 다양한 작업에서 전례 없는 수준의 제어 가능성을 보여줄 수 있습니다.

English

We present CM3Leon (pronounced "Chameleon"), a retrieval-augmented, token-based, decoder-only multi-modal language model capable of generating and infilling both text and images. CM3Leon uses the CM3 multi-modal architecture but additionally shows the extreme benefits of scaling up and tuning on more diverse instruction-style data. It is the first multi-modal model trained with a recipe adapted from text-only language models, including a large-scale retrieval-augmented pre-training stage and a second multi-task supervised fine-tuning (SFT) stage. It is also a general-purpose model that can do both text-to-image and image-to-text generation, allowing us to introduce self-contained contrastive decoding methods that produce high-quality outputs. Extensive experiments demonstrate that this recipe is highly effective for multi-modal models. CM3Leon achieves state-of-the-art performance in text-to-image generation with 5x less training compute than comparable methods (zero-shot MS-COCO FID of 4.88). After SFT, CM3Leon can also demonstrate unprecedented levels of controllability in tasks ranging from language-guided image editing to image-controlled generation and segmentation.

자동회귀 다중모달 모델의 확장: 사전 학습과 지시 튜닝

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

초록

Support