UMM 시각 생성 재고찰: 효율적인 이미지 전용 사전 학습을 위한 마스크 모델링

초록

통합 멀티모달 모델(UMM)은 시각 생성 구성 요소의 사전 학습에 의해 종종 제약을 받으며, 이는 일반적으로 비효율적인 패러다임과 부족한 고품질 텍스트-이미지 쌍 데이터에 의존합니다. 본 논문에서는 UMM 시각 생성을 위한 사전 학습 방법을 체계적으로 분석하고 이 두 가지 문제가 주요 병목 현상임을 확인합니다. 이를 해결하기 위해 우리는 데이터 효율적인 2단계 학습 프레임워크인 **IOMM(Image-Only Training for UMMs)**을 제안합니다. 첫 번째 단계에서는 풍부한 레이블 없는 이미지 전용 데이터만을 사용하여 시각 생성 구성 요소를 사전 학습함으로써, 이 고비용 단계에서 쌍 데이터에 대한 의존성을 제거합니다. 두 번째 단계에서는 레이블 없는 이미지와 소량의 정제된 텍스트-이미지 쌍 데이터 세트를 혼합하여 모델을 미세 조정함으로써, 향상된 지시어 준수도와 생성 품질을 달성합니다. 광범위한 실험을 통해 IOMM이 학습 효율성을 향상시킬 뿐만 아니라 최첨단(SOTA) 성능을 달성함을 보여줍니다. 예를 들어, 우리의 IOMM-B (3.6B) 모델은 약 1050 H800 GPU 시간(그중 대부분인 1000시간은 효율적인 이미지 전용 사전 학습 단계에 사용됨)만으로 처음부터 학습되었습니다. 이 모델은 GenEval에서 0.89, WISE에서 0.55를 달성하여 BAGEL-7B (0.82 & 0.55) 및 BLIP3-o-4B (0.84 & 0.50)와 같은 강력한 기준 모델을 능가합니다. 코드는 https://github.com/LINs-lab/IOMM에서 확인할 수 있습니다.

English

Unified Multimodal Models (UMMs) are often constrained by the pre-training of their visual generation components, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for UMM visual generation and identify these two issues as the major bottlenecks. To address them, we propose Image-Only Training for UMMs (IOMM), a data-efficient two-stage training framework. The first stage pre-trains the visual generative component exclusively using abundant unlabeled image-only data, thereby removing the dependency on paired data for this costly phase. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only sim 1050 H800 GPU hours (with the vast majority, 1000 hours, dedicated to the efficient image-only pre-training stage). It achieves 0.89 on GenEval and 0.55 on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50). Code is available https://github.com/LINs-lab/IOMM{https://github.com/LINs-lab/IOMM}.

UMM 시각 생성 재고찰: 효율적인 이미지 전용 사전 학습을 위한 마스크 모델링

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

초록

Support