重新审视UMM视觉生成：面向高效纯图像预训练的掩码建模方法

摘要

统一多模态模型（UMMs）的视觉生成组件通常受限于其预训练过程，这类预训练往往依赖于低效的范式以及稀缺的高质量图文配对数据。本文系统分析了UMM视觉生成的预训练方案，发现这两大问题是主要瓶颈。为此，我们提出面向UMM的纯图像训练框架（IOMM）——一种数据高效的双阶段训练方法。第一阶段仅利用海量无标注的纯图像数据对视觉生成组件进行预训练，从而在这一高成本阶段消除对配对数据的依赖；第二阶段通过混合使用无标注图像和少量精选图文配对数据对模型进行微调，显著提升指令对齐能力与生成质量。大量实验表明，IOMM不仅提升了训练效率，更达到了业界领先性能。例如，我们的IOMM-B（36亿参数）模型仅消耗约1050 H800 GPU小时即完成从零训练（其中绝大部分1000小时用于高效的纯图像预训练阶段），在GenEval和WISE评估中分别取得0.89和0.55的分数，超越了BAGEL-7B（0.82和0.55）和BLIP3-o-4B（0.84和0.50）等强基线模型。代码已开源：https://github.com/LINs-lab/IOMM。

English

Unified Multimodal Models (UMMs) are often constrained by the pre-training of their visual generation components, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for UMM visual generation and identify these two issues as the major bottlenecks. To address them, we propose Image-Only Training for UMMs (IOMM), a data-efficient two-stage training framework. The first stage pre-trains the visual generative component exclusively using abundant unlabeled image-only data, thereby removing the dependency on paired data for this costly phase. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only sim 1050 H800 GPU hours (with the vast majority, 1000 hours, dedicated to the efficient image-only pre-training stage). It achieves 0.89 on GenEval and 0.55 on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50). Code is available https://github.com/LINs-lab/IOMM{https://github.com/LINs-lab/IOMM}.

重新审视UMM视觉生成：面向高效纯图像预训练的掩码建模方法

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

摘要

Support