理解與生成能否真正互惠共進——抑或僅是並存?
Can Understanding and Generation Truly Benefit Together -- or Just Coexist?
September 11, 2025
作者: Zhiyuan Yan, Kaiqing Lin, Zongjian Li, Junyan Ye, Hui Han, Zhendong Wang, Hao Liu, Bin Lin, Hao Li, Xue Xu, Xinyan Xiao, Jingdong Wang, Haifeng Wang, Li Yuan
cs.AI
摘要
本文提出了一種富有洞察力的範式,通過自動編碼器的視角來理解圖像到文本的編碼過程(I2T)以及從文本重建圖像的解碼過程(T2I)。我們以重建保真度作為統一的訓練目標,強化了理解與生成過程之間連貫的雙向信息流,從而實現了相互增益。為此,我們提出了UAE,一個新穎的統一多模態學習框架。首先,我們利用大規模長上下文圖像描述對解碼器進行預訓練,以捕捉細粒度的語義和複雜的空間關係。接著,我們通過強化學習(RL)提出了Unified-GRPO,涵蓋三個階段:(1)冷啟動階段,使用語義重建損失溫和地初始化編碼器和解碼器;(2)生成促進理解階段,訓練編碼器生成信息豐富的描述,最大化解碼器的重建質量,從而增強其視覺理解能力;(3)理解促進生成階段,精煉解碼器以從這些描述中重建圖像,迫使它利用每一個細節,提升其長上下文指令跟隨和生成保真度。為評估效果,我們引入了Unified-Bench,這是首個專門用於評估統一多模態模型(UMMs)統一程度的基準測試。在多模態學習領域中出現了一個令人驚喜的“頓悟時刻”:隨著RL的進展,編碼器自主生成更具描述性的文本,而解碼器同時展現出深刻理解這些複雜描述的能力,從而實現了驚人的重建保真度。
English
In this paper, we introduce an insightful paradigm through the Auto-Encoder
lens-understanding as the encoder (I2T) that compresses images into text, and
generation as the decoder (T2I) that reconstructs images from that text. Using
reconstruction fidelity as the unified training objective, we enforce the
coherent bidirectional information flow between the understanding and
generation processes, bringing mutual gains. To implement this, we propose UAE,
a novel framework for unified multimodal learning. We begin by pre-training the
decoder with large-scale long-context image captions to capture fine-grained
semantic and complex spatial relationships. We then propose Unified-GRPO via
reinforcement learning (RL), which covers three stages: (1) A cold-start phase
to gently initialize both encoder and decoder with a semantic reconstruction
loss; (2) Generation for Understanding, where the encoder is trained to
generate informative captions that maximize the decoder's reconstruction
quality, enhancing its visual understanding; (3) Understanding for Generation,
where the decoder is refined to reconstruct from these captions, forcing it to
leverage every detail and improving its long-context instruction following and
generation fidelity. For evaluation, we introduce Unified-Bench, the first
benchmark tailored to assess the degree of unification of the UMMs. A
surprising "aha moment" arises within the multimodal learning domain: as RL
progresses, the encoder autonomously produces more descriptive captions, while
the decoder simultaneously demonstrates a profound ability to understand these
intricate descriptions, resulting in reconstructions of striking fidelity.