梦境之境：当视觉理解邂逅文本生成图像

摘要

在多模态学习领域，如何将视觉表征学习与文本到图像生成统一于单一模型仍是核心挑战。我们提出DREAM框架，通过联合优化判别式与生成式目标实现统一建模，同时学习强视觉表征。该框架基于两项关键技术：训练阶段采用"掩码预热"策略，通过渐进式掩码调度机制，初期低掩码率建立表征学习所需的对比对齐，后期过渡至全掩码以稳定生成训练；推理阶段运用"语义对齐解码"，将部分掩码的候选图像与目标文本对齐并优选最佳样本进行解码，在无外部重排器情况下提升图文保真度6.3%。仅使用CC12M数据训练，DREAM在ImageNet线性探测准确率达72.7%（较CLIP提升1.1%），FID指标为4.25（较FLUID优化6.2%），并在小样本分类、语义分割及深度估计任务中持续领先。结果表明判别式与生成式目标具有协同效应，可构建兼具视觉理解与生成能力的统一多模态模型。

English

Unifying visual representation learning and text-to-image (T2I) generation within a single model remains a central challenge in multimodal learning. We introduce DREAM, a unified framework that jointly optimizes discriminative and generative objectives, while learning strong visual representations. DREAM is built on two key techniques: During training, Masking Warmup, a progressive masking schedule, begins with minimal masking to establish the contrastive alignment necessary for representation learning, then gradually transitions to full masking for stable generative training. At inference, DREAM employs Semantically Aligned Decoding to align partially masked image candidates with the target text and select the best one for further decoding, improving text-image fidelity (+6.3%) without external rerankers. Trained solely on CC12M, DREAM achieves 72.7% ImageNet linear-probing accuracy (+1.1% over CLIP) and an FID of 4.25 (+6.2% over FLUID), with consistent gains in few-shot classification, semantic segmentation, and depth estimation. These results demonstrate that discriminative and generative objectives can be synergistic, allowing unified multimodal models that excel at both visual understanding and generation.

梦境之境：当视觉理解邂逅文本生成图像

DREAM: Where Visual Understanding Meets Text-to-Image Generation

摘要

Support