RÊVE : Où la compréhension visuelle rencontre la génération d'images à partir de texte

Résumé

L'unification de l'apprentissage de représentation visuelle et de la génération texte-image (T2I) au sein d'un modèle unique reste un défi central en apprentissage multimodal. Nous présentons DREAM, un cadre unifié qui optimise conjointement des objectifs discriminatifs et génératifs, tout en apprenant de fortes représentations visuelles. DREAM repose sur deux techniques clés : Pendant l'entraînement, le *Masking Warmup*, un échéancier de masquage progressif, commence par un masquage minimal pour établir l'alignement contrastif nécessaire à l'apprentissage de représentations, puis transitionne graduellement vers un masquage complet pour un entraînement génératif stable. À l'inférence, DREAM utilise le *Semantically Aligned Decoding* pour aligner des candidats d'images partiellement masquées avec le texte cible et sélectionne le meilleur pour un décodage ultérieur, améliorant la fidélité texte-image (+6,3 %) sans reclassificateurs externes. Entraîné uniquement sur CC12M, DREAM atteint une précision en sondage linéaire sur ImageNet de 72,7 % (+1,1 % par rapport à CLIP) et un FID de 4,25 (+6,2 % par rapport à FLUID), avec des gains constants en classification *few-shot*, segmentation sémantique et estimation de profondeur. Ces résultats démontrent que les objectifs discriminatifs et génératifs peuvent être synergiques, permettant à des modèles multimodaux unifiés d'exceller à la fois en compréhension visuelle et en génération.

English

Unifying visual representation learning and text-to-image (T2I) generation within a single model remains a central challenge in multimodal learning. We introduce DREAM, a unified framework that jointly optimizes discriminative and generative objectives, while learning strong visual representations. DREAM is built on two key techniques: During training, Masking Warmup, a progressive masking schedule, begins with minimal masking to establish the contrastive alignment necessary for representation learning, then gradually transitions to full masking for stable generative training. At inference, DREAM employs Semantically Aligned Decoding to align partially masked image candidates with the target text and select the best one for further decoding, improving text-image fidelity (+6.3%) without external rerankers. Trained solely on CC12M, DREAM achieves 72.7% ImageNet linear-probing accuracy (+1.1% over CLIP) and an FID of 4.25 (+6.2% over FLUID), with consistent gains in few-shot classification, semantic segmentation, and depth estimation. These results demonstrate that discriminative and generative objectives can be synergistic, allowing unified multimodal models that excel at both visual understanding and generation.

RÊVE : Où la compréhension visuelle rencontre la génération d'images à partir de texte

DREAM: Where Visual Understanding Meets Text-to-Image Generation

Résumé

Support