大規模自己回帰型マルチモーダルモデルの共同学習

要旨

近年、大規模な言語モデルとテキストから画像へのモデルの事前学習の進展により、機械学習の分野は革命を遂げてきました。しかし、これら二つのモダリティを統合し、シームレスなマルチモーダル出力を生成できる単一の堅牢なモデルを構築することは、依然として重要な課題です。このギャップを埋めるため、我々はJoint Autoregressive Mixture (JAM)フレームワークを提案します。これは、既存のテキスト生成モデルと画像生成モデルを体系的に融合するモジュール型アプローチです。さらに、混合モーダル生成タスクに特化した、データ効率の良い指示チューニング戦略を導入しました。最終的な指示チューニング済みモデルは、高品質なマルチモーダル出力の生成において他を圧倒する性能を示し、この目的のために明示的に設計された初のモデルとして位置づけられます。

English

In recent years, advances in the large-scale pretraining of language and text-to-image models have revolutionized the field of machine learning. Yet, integrating these two modalities into a single, robust model capable of generating seamless multimodal outputs remains a significant challenge. To address this gap, we present the Joint Autoregressive Mixture (JAM) framework, a modular approach that systematically fuses existing text and image generation models. We also introduce a specialized, data-efficient instruction-tuning strategy, tailored for mixed-modal generation tasks. Our final instruct-tuned model demonstrates unparalleled performance in generating high-quality multimodal outputs and represents the first model explicitly designed for this purpose.

大規模自己回帰型マルチモーダルモデルの共同学習

Jointly Training Large Autoregressive Multimodal Models

要旨

Support