共同訓練大型自回歸多模態模型

摘要

近年來，大規模語言預訓練和文本到圖像模型的進步已經徹底改變了機器學習領域。然而，將這兩種模態整合為一個能夠生成無縫多模態輸出的單一、強大模型仍然是一個重大挑戰。為了應對這一缺口，我們提出聯合自回歸混合（JAM）框架，這是一種模塊化方法，系統地融合現有的文本和圖像生成模型。我們還引入了一種專門的、高效的數據調整策略，針對混合模態生成任務量身定制。我們最終的指導調整模型展示了在生成高質量多模態輸出方面無與倫比的性能，並且是首個專門為此目的而設計的模型。

English

In recent years, advances in the large-scale pretraining of language and text-to-image models have revolutionized the field of machine learning. Yet, integrating these two modalities into a single, robust model capable of generating seamless multimodal outputs remains a significant challenge. To address this gap, we present the Joint Autoregressive Mixture (JAM) framework, a modular approach that systematically fuses existing text and image generation models. We also introduce a specialized, data-efficient instruction-tuning strategy, tailored for mixed-modal generation tasks. Our final instruct-tuned model demonstrates unparalleled performance in generating high-quality multimodal outputs and represents the first model explicitly designed for this purpose.

共同訓練大型自回歸多模態模型

Jointly Training Large Autoregressive Multimodal Models

摘要

Support