HermesFlow:無縫地縮小多模式理解和生成之間的差距
HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation
February 17, 2025
作者: Ling Yang, Xinchen Zhang, Ye Tian, Chenming Shang, Minghao Xu, Wentao Zhang, Bin Cui
cs.AI
摘要
自回歸範式的卓越成功在多模式大型語言模型(MLLMs)方面取得了重大進展,強大的模型如Show-o、Transfusion和Emu3在統一的圖像理解和生成方面取得了顯著進步。我們首次揭示了一個共同現象:MLLMs的理解能力通常強於生成能力,兩者之間存在顯著差距。基於這一洞察,我們提出了HermesFlow,這是一個簡單而通用的框架,旨在無縫地彌合MLLMs中理解和生成之間的差距。具體而言,我們將同源數據作為輸入,以編輯理解和生成的同源偏好數據。通過Pair-DPO和自我對弈迭代優化,HermesFlow有效地使用同源偏好數據對齊多模式理解和生成。大量實驗證明了我們方法相對於先前方法的顯著優越性,特別是在縮小多模式理解和生成之間差距方面。這些發現突顯了HermesFlow作為下一代多模式基礎模型的通用對齊框架的潛力。程式碼:https://github.com/Gen-Verse/HermesFlow
English
The remarkable success of the autoregressive paradigm has made significant
advancement in Multimodal Large Language Models (MLLMs), with powerful models
like Show-o, Transfusion and Emu3 achieving notable progress in unified image
understanding and generation. For the first time, we uncover a common
phenomenon: the understanding capabilities of MLLMs are typically stronger than
their generative capabilities, with a significant gap between the two. Building
on this insight, we propose HermesFlow, a simple yet general framework designed
to seamlessly bridge the gap between understanding and generation in MLLMs.
Specifically, we take the homologous data as input to curate homologous
preference data of both understanding and generation. Through Pair-DPO and
self-play iterative optimization, HermesFlow effectively aligns multimodal
understanding and generation using homologous preference data. Extensive
experiments demonstrate the significant superiority of our approach over prior
methods, particularly in narrowing the gap between multimodal understanding and
generation. These findings highlight the potential of HermesFlow as a general
alignment framework for next-generation multimodal foundation models. Code:
https://github.com/Gen-Verse/HermesFlowSummary
AI-Generated Summary