DREAM: 視覚理解とテキストから画像への生成の融合

要旨

視覚的表現学習とテキストから画像への生成（T2I）を単一モデル内で統合することは、マルチモーダル学習における中心的な課題である。本論文では、識別的目標と生成的目標を共同で最適化し、強力な視覚表現を学習する統合フレームワーク「DREAM」を提案する。DREAMは二つの主要な技術に基づいている：学習時には、段階的マスキングスケジュールである「Masking Warmup」を採用し、表現学習に必要な対照的アラインメントを確立するために最小限のマスキングから開始し、その後、安定的な生成的学習に向けて完全マスキングへと徐々に移行する。推論時には、「意味的アライメント復号」を用いて、部分的にマスクされた画像候補を対象テキストと整合させ、さらなる復号に最適なものを選択することで、外部リランキングモデルを用いることなくテキスト-画像の忠実度を向上させる（+6.3%）。CC12Mのみで学習したDREAMは、ImageNet線形 probing精度で72.7%（CLIP比+1.1%）、FIDで4.25（FLUID比+6.2%）を達成し、数発分類、意味的セグメンテーション、深度推定においても一貫した性能向上を示した。これらの結果は、識別的目標と生成的目標が相乗効果を発揮し、視覚的理解と生成の両方に優れた統合マルチモーダルモデルを実現可能であることを示唆している。

English

Unifying visual representation learning and text-to-image (T2I) generation within a single model remains a central challenge in multimodal learning. We introduce DREAM, a unified framework that jointly optimizes discriminative and generative objectives, while learning strong visual representations. DREAM is built on two key techniques: During training, Masking Warmup, a progressive masking schedule, begins with minimal masking to establish the contrastive alignment necessary for representation learning, then gradually transitions to full masking for stable generative training. At inference, DREAM employs Semantically Aligned Decoding to align partially masked image candidates with the target text and select the best one for further decoding, improving text-image fidelity (+6.3%) without external rerankers. Trained solely on CC12M, DREAM achieves 72.7% ImageNet linear-probing accuracy (+1.1% over CLIP) and an FID of 4.25 (+6.2% over FLUID), with consistent gains in few-shot classification, semantic segmentation, and depth estimation. These results demonstrate that discriminative and generative objectives can be synergistic, allowing unified multimodal models that excel at both visual understanding and generation.

DREAM: 視覚理解とテキストから画像への生成の融合

DREAM: Where Visual Understanding Meets Text-to-Image Generation

要旨

Support