理解与生成能否真正相互促进——抑或仅是共存?
Can Understanding and Generation Truly Benefit Together -- or Just Coexist?
September 11, 2025
作者: Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Zhendong Wang, Hao Liu, Bin Lin, Hao Li, Xue Xu, Xinyan Xiao, Jingdong Wang, Haifeng Wang, Li Yuan
cs.AI
摘要
本文通过自编码器视角引入了一个富有洞察力的范式:将理解过程视为编码器(I2T),将图像压缩为文本;将生成过程视为解码器(T2I),从文本重建图像。以重建保真度作为统一的训练目标,我们强化了理解与生成过程之间的双向信息流,实现了相互增益。为此,我们提出了UAE,一个新颖的统一多模态学习框架。首先,我们利用大规模长上下文图像描述对解码器进行预训练,以捕捉细粒度语义和复杂空间关系。随后,我们提出了基于强化学习(RL)的Unified-GRPO,涵盖三个阶段:(1)冷启动阶段,通过语义重建损失温和初始化编码器和解码器;(2)理解促进生成阶段,训练编码器生成信息丰富的描述,以最大化解码器的重建质量,增强其视觉理解能力;(3)生成促进理解阶段,优化解码器从这些描述中重建图像,迫使其利用每一个细节,提升其长上下文指令遵循和生成保真度。为评估模型,我们引入了Unified-Bench,这是首个专门用于评估统一多模态模型(UMMs)统一程度的基准。在多模态学习领域出现了一个令人惊喜的“顿悟时刻”:随着RL的推进,编码器自主生成更具描述性的文本,而解码器同时展现出深刻理解这些复杂描述的能力,从而实现了惊人的重建保真度。
English
In this paper, we introduce an insightful paradigm through the Auto-Encoder
lens-understanding as the encoder (I2T) that compresses images into text, and
generation as the decoder (T2I) that reconstructs images from that text. Using
reconstruction fidelity as the unified training objective, we enforce the
coherent bidirectional information flow between the understanding and
generation processes, bringing mutual gains. To implement this, we propose UAE,
a novel framework for unified multimodal learning. We begin by pre-training the
decoder with large-scale long-context image captions to capture fine-grained
semantic and complex spatial relationships. We then propose Unified-GRPO via
reinforcement learning (RL), which covers three stages: (1) A cold-start phase
to gently initialize both encoder and decoder with a semantic reconstruction
loss; (2) Generation for Understanding, where the encoder is trained to
generate informative captions that maximize the decoder's reconstruction
quality, enhancing its visual understanding; (3) Understanding for Generation,
where the decoder is refined to reconstruct from these captions, forcing it to
leverage every detail and improving its long-context instruction following and
generation fidelity. For evaluation, we introduce Unified-Bench, the first
benchmark tailored to assess the degree of unification of the UMMs. A
surprising "aha moment" arises within the multimodal learning domain: as RL
progresses, the encoder autonomously produces more descriptive captions, while
the decoder simultaneously demonstrates a profound ability to understand these
intricate descriptions, resulting in reconstructions of striking fidelity.