ChatPaper.aiChatPaper

理解與生成:多模態模型中的優化困境探析

Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models

February 17, 2026
作者: Sen Ye, Mengde Xu, Shuyang Gu, Di He, Liwei Wang, Han Hu
cs.AI

摘要

當前多模態模型研究面臨一項關鍵挑戰:提升生成能力往往會削弱理解能力,反之亦然。我們通過分析發現,這種權衡關係的根源可能在於生成與理解之間的潛在衝突,這種衝突在模型內部形成了競爭動態。為解決此問題,我們提出「推理-反思-優化」(R3)框架。該創新算法將單步生成任務重構為「生成-理解-再生成」的多步流程,通過在生成過程中顯式調用模型的理解能力,成功緩解了優化困境,不僅實現了更強的生成效果,還提升了與生成過程相關的理解能力。這為設計下一代統一多模態模型提供了重要啟發。程式碼已開源於:https://github.com/sen-ye/R3。
English
Current research in multimodal models faces a key challenge where enhancing generative capabilities often comes at the expense of understanding, and vice versa. We analyzed this trade-off and identify the primary cause might be the potential conflict between generation and understanding, which creates a competitive dynamic within the model. To address this, we propose the Reason-Reflect-Refine (R3) framework. This innovative algorithm re-frames the single-step generation task into a multi-step process of "generate-understand-regenerate". By explicitly leveraging the model's understanding capability during generation, we successfully mitigate the optimization dilemma, achieved stronger generation results and improved understanding ability which are related to the generation process. This offers valuable insights for designing next-generation unified multimodal models. Code is available at https://github.com/sen-ye/R3.
PDF51February 19, 2026