DREAM: 시각적 이해와 텍스트-이미지 생성의 만남

초록

시각 표현 학습과 텍스트-이미지(T2I) 생성을 단일 모델 내에서 통합하는 것은 멀티모달 학습의 핵심 과제로 남아 있습니다. 본 연구에서는 강력한 시각 표현을 학습하면서 판별적 목적과 생성적 목적을 공동으로 최적화하는 통합 프레임워크인 DREAM을 소개합니다. DREAM은 두 가지 핵심 기술을 기반으로 합니다: 학습 중에는 점진적 마스킹 스케줄인 '마스킹 워밍업'을 적용하여 표현 학습에 필요한 대조적 정렬을 확립하기 위해 최소 마스킹으로 시작한 후, 안정적인 생성 학습을 위해 점차 완전 마스킹으로 전환합니다. 추론 단계에서는 '의미론적 정렬 디코딩'을 통해 부분적으로 마스킹된 이미지 후보들을 대상 텍스트와 정렬시키고 최적의 후보를 선택하여 추가 디코딩을 수행함으로써, 외부 재순위 모델 없이도 텍스트-이미지 정확도를 향상시킵니다(+6.3%). CC12M 데이터만으로 학습된 DREAM은 ImageNet 선형 탐사 정확도 72.7%(CLIP 대비 +1.1%)와 FID 4.25(FLUID 대비 +6.2%)를 달성했으며, 퓨샷 분류, 의미론적 분할, 깊이 추정에서도 일관된 성능 향상을 보였습니다. 이러한 결과는 판별적 목적과 생성적 목적이 상호 시너지 효과를 발휘하여 시각적 이해와 생성 모두에서 뛰어난 통합 멀티모달 모델이 가능함을 입증합니다.

English

Unifying visual representation learning and text-to-image (T2I) generation within a single model remains a central challenge in multimodal learning. We introduce DREAM, a unified framework that jointly optimizes discriminative and generative objectives, while learning strong visual representations. DREAM is built on two key techniques: During training, Masking Warmup, a progressive masking schedule, begins with minimal masking to establish the contrastive alignment necessary for representation learning, then gradually transitions to full masking for stable generative training. At inference, DREAM employs Semantically Aligned Decoding to align partially masked image candidates with the target text and select the best one for further decoding, improving text-image fidelity (+6.3%) without external rerankers. Trained solely on CC12M, DREAM achieves 72.7% ImageNet linear-probing accuracy (+1.1% over CLIP) and an FID of 4.25 (+6.2% over FLUID), with consistent gains in few-shot classification, semantic segmentation, and depth estimation. These results demonstrate that discriminative and generative objectives can be synergistic, allowing unified multimodal models that excel at both visual understanding and generation.

DREAM: 시각적 이해와 텍스트-이미지 생성의 만남

DREAM: Where Visual Understanding Meets Text-to-Image Generation

초록

Support