대규모 자기회귀적 다중모달 모델의 공동 학습

초록

최근 대규모 언어 모델 및 텍스트-이미지 모델의 사전 학습 기술 발전이 머신러닝 분야에 혁신을 가져왔습니다. 그러나 이 두 가지 양식을 통합하여 원활한 다중모달 출력을 생성할 수 있는 강력한 단일 모델을 만드는 것은 여전히 중요한 과제로 남아 있습니다. 이러한 격차를 해결하기 위해, 우리는 기존 텍스트 및 이미지 생성 모델을 체계적으로 융합하는 모듈식 접근 방식인 Joint Autoregressive Mixture(JAM) 프레임워크를 제안합니다. 또한 혼합 모달 생성 작업에 특화된 데이터 효율적인 지시 튜닝 전략을 소개합니다. 최종적으로 지시 튜닝된 우리의 모델은 고품질 다중모달 출력 생성에서 탁월한 성능을 보여주며, 이러한 목적을 위해 명시적으로 설계된 최초의 모델로 자리매김합니다.

English

In recent years, advances in the large-scale pretraining of language and text-to-image models have revolutionized the field of machine learning. Yet, integrating these two modalities into a single, robust model capable of generating seamless multimodal outputs remains a significant challenge. To address this gap, we present the Joint Autoregressive Mixture (JAM) framework, a modular approach that systematically fuses existing text and image generation models. We also introduce a specialized, data-efficient instruction-tuning strategy, tailored for mixed-modal generation tasks. Our final instruct-tuned model demonstrates unparalleled performance in generating high-quality multimodal outputs and represents the first model explicitly designed for this purpose.

대규모 자기회귀적 다중모달 모델의 공동 학습

Jointly Training Large Autoregressive Multimodal Models

초록

Support