重新审视UMM视觉生成：面向高效纯图像预训练的掩码建模方法

摘要

统一多模态模型（UMMs）的视觉生成组件通常受限于预训练过程，这类预训练往往依赖低效范式且缺乏高质量图文配对数据。本文系统分析了UMM视觉生成的预训练方案，发现上述两个问题是主要瓶颈。为此，我们提出面向UMM的纯图像训练框架（IOMM）——一种数据高效的双阶段训练方法。第一阶段仅利用海量无标注纯图像数据对视觉生成组件进行预训练，从而在这一高成本阶段消除对配对数据的依赖。第二阶段使用无标注图像与少量精选图文配对数据混合微调模型，显著提升指令对齐能力与生成质量。大量实验表明，IOMM不仅提升了训练效率，更实现了最先进性能。例如，我们的IOMM-B（36亿参数）模型仅消耗约1050 H800 GPU小时即完成从头训练（其中1000小时用于高效的纯图像预训练阶段），在GenEval和WISE评估中分别取得0.89和0.55的分数，超越BAGEL-70亿（0.82和0.55）及BLIP3-o-40亿（0.84和0.50）等强基线模型。代码已开源：https://github.com/LINs-lab/IOMM。

English

Unified Multimodal Models (UMMs) are often constrained by the pre-training of their visual generation components, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for UMM visual generation and identify these two issues as the major bottlenecks. To address them, we propose Image-Only Training for UMMs (IOMM), a data-efficient two-stage training framework. The first stage pre-trains the visual generative component exclusively using abundant unlabeled image-only data, thereby removing the dependency on paired data for this costly phase. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only sim 1050 H800 GPU hours (with the vast majority, 1000 hours, dedicated to the efficient image-only pre-training stage). It achieves 0.89 on GenEval and 0.55 on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50). Code is available https://github.com/LINs-lab/IOMM{https://github.com/LINs-lab/IOMM}.