重新审视UMM视觉生成:面向高效纯图像预训练的掩码建模方法
Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training
March 17, 2026
作者: Peng Sun, Jun Xie, Tao Lin
cs.AI
摘要
统一多模态模型(UMMs)的视觉生成组件通常受限于预训练过程,这类预训练往往依赖低效范式且缺乏高质量图文配对数据。本文系统分析了UMM视觉生成的预训练方案,发现上述两个问题是主要瓶颈。
为此,我们提出面向UMM的纯图像训练框架(IOMM)——一种数据高效的双阶段训练方法。第一阶段仅利用海量无标注纯图像数据对视觉生成组件进行预训练,从而在这一高成本阶段消除对配对数据的依赖。第二阶段使用无标注图像与少量精选图文配对数据混合微调模型,显著提升指令对齐能力与生成质量。
大量实验表明,IOMM不仅提升了训练效率,更实现了最先进性能。例如,我们的IOMM-B(36亿参数)模型仅消耗约1050 H800 GPU小时即完成从头训练(其中1000小时用于高效的纯图像预训练阶段),在GenEval和WISE评估中分别取得0.89和0.55的分数,超越BAGEL-70亿(0.82和0.55)及BLIP3-o-40亿(0.84和0.50)等强基线模型。
代码已开源:https://github.com/LINs-lab/IOMM。
English
Unified Multimodal Models (UMMs) are often constrained by the pre-training of their visual generation components, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for UMM visual generation and identify these two issues as the major bottlenecks.
To address them, we propose Image-Only Training for UMMs (IOMM), a data-efficient two-stage training framework.
The first stage pre-trains the visual generative component exclusively using abundant unlabeled image-only data, thereby removing the dependency on paired data for this costly phase. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality.
Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance.
For example, our IOMM-B (3.6B) model was trained from scratch using only sim 1050 H800 GPU hours (with the vast majority, 1000 hours, dedicated to the efficient image-only pre-training stage). It achieves 0.89 on GenEval and 0.55 on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50).
Code is available https://github.com/LINs-lab/IOMM{https://github.com/LINs-lab/IOMM}.