UMM視覚生成の再考：効率的な画像のみ事前学習のためのマスクモデリング

要旨

統一マルチモーダルモデル（UMM）は、視覚生成コンポーネントの事前学習によって制約を受けることが多い。この事前学習は、一般に非効率なパラダイムと、乏しい高品質なテキスト-画像ペアデータに依存している。本論文では、UMMの視覚生成における事前学習の方法論を体系的に分析し、これら2つの問題が主要なボトルネックであることを明らかにする。これらの問題に対処するため、我々はデータ効率の良い2段階トレーニングフレームワークである、Image-Only Training for UMMs（IOMM）を提案する。第1段階では、豊富なラベルなし画像のみのデータを専用に用いて視覚生成コンポーネントを事前学習し、このコストの高い段階におけるペアデータへの依存を排除する。第2段階では、ラベルなし画像と少量の精選されたテキスト-画像ペアデータの混合を用いてモデルをファインチューニングし、指示への適合性と生成品質の向上を図る。大規模な実験により、IOMMがトレーニング効率を改善するだけでなく、State-of-the-Art（SOTA）の性能を達成することを示す。例えば、我々のIOMM-B（3.6B）モデルは、わずか約1050 H800 GPU時間（うち大部分の1000時間は効率的な画像のみの事前学習段階に充てられた）を用いてスクラッチから学習されたが、GenEvalで0.89、WISEで0.55を達成し、BAGEL-7B（0.82 & 0.55）やBLIP3-o-4B（0.84 & 0.50）といった強力なベースラインを凌駕している。コードはhttps://github.com/LINs-lab/IOMM で公開されている。

English

Unified Multimodal Models (UMMs) are often constrained by the pre-training of their visual generation components, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for UMM visual generation and identify these two issues as the major bottlenecks. To address them, we propose Image-Only Training for UMMs (IOMM), a data-efficient two-stage training framework. The first stage pre-trains the visual generative component exclusively using abundant unlabeled image-only data, thereby removing the dependency on paired data for this costly phase. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only sim 1050 H800 GPU hours (with the vast majority, 1000 hours, dedicated to the efficient image-only pre-training stage). It achieves 0.89 on GenEval and 0.55 on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50). Code is available https://github.com/LINs-lab/IOMM{https://github.com/LINs-lab/IOMM}.

UMM視覚生成の再考：効率的な画像のみ事前学習のためのマスクモデリング

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

要旨

Support