HermesFlow：無縫地縮小多模式理解和生成之間的差距

摘要

自回歸範式的卓越成功在多模式大型語言模型（MLLMs）方面取得了重大進展，強大的模型如Show-o、Transfusion和Emu3在統一的圖像理解和生成方面取得了顯著進步。我們首次揭示了一個共同現象：MLLMs的理解能力通常強於生成能力，兩者之間存在顯著差距。基於這一洞察，我們提出了HermesFlow，這是一個簡單而通用的框架，旨在無縫地彌合MLLMs中理解和生成之間的差距。具體而言，我們將同源數據作為輸入，以編輯理解和生成的同源偏好數據。通過Pair-DPO和自我對弈迭代優化，HermesFlow有效地使用同源偏好數據對齊多模式理解和生成。大量實驗證明了我們方法相對於先前方法的顯著優越性，特別是在縮小多模式理解和生成之間差距方面。這些發現突顯了HermesFlow作為下一代多模式基礎模型的通用對齊框架的潛力。程式碼：https://github.com/Gen-Verse/HermesFlow

English

The remarkable success of the autoregressive paradigm has made significant advancement in Multimodal Large Language Models (MLLMs), with powerful models like Show-o, Transfusion and Emu3 achieving notable progress in unified image understanding and generation. For the first time, we uncover a common phenomenon: the understanding capabilities of MLLMs are typically stronger than their generative capabilities, with a significant gap between the two. Building on this insight, we propose HermesFlow, a simple yet general framework designed to seamlessly bridge the gap between understanding and generation in MLLMs. Specifically, we take the homologous data as input to curate homologous preference data of both understanding and generation. Through Pair-DPO and self-play iterative optimization, HermesFlow effectively aligns multimodal understanding and generation using homologous preference data. Extensive experiments demonstrate the significant superiority of our approach over prior methods, particularly in narrowing the gap between multimodal understanding and generation. These findings highlight the potential of HermesFlow as a general alignment framework for next-generation multimodal foundation models. Code: https://github.com/Gen-Verse/HermesFlow

HermesFlow：無縫地縮小多模式理解和生成之間的差距

HermesFlow: Seamlessly Closing the Gap in Multimodal Understanding and Generation

摘要

Support