透過再生實現精煉：擴大修改空間提升統一多模態模型的圖像精細化能力

摘要

統一多模態模型（UMMs）在單一框架內整合了視覺理解與生成能力。針對文生圖（T2I）任務，這種統一特性使UMMs能在初始生成後對輸出結果進行優化，從而有望突破性能上限。當前基於UMM的優化方法主要遵循「編輯式優化」（RvE）範式，即模型產生編輯指令來修正未對齊區域，同時保留已對齊內容。然而編輯指令往往僅能粗略描述提示詞與圖像間的錯位問題，導致優化不徹底。此外，像素級內容保留雖為編輯所需，卻不必要地限制了有效優化空間。為解決這些局限，我們提出「再生式優化」（RvR）新框架，將優化重新定義為條件式圖像再生而非編輯。RvR不再依賴編輯指令與強制性內容保留，而是根據目標提示詞與初始圖像語義標記進行條件化再生，從而實現更完整的語義對齊與更寬廣的修改空間。大量實驗驗證了RvR的有效性：Geneval指標從0.78提升至0.91，DPGBench從84.02進步到87.21，UniGenBench++更從61.53躍升至77.41。

English

Unified multimodal models (UMMs) integrate visual understanding and generation within a single framework. For text-to-image (T2I) tasks, this unified capability allows UMMs to refine outputs after their initial generation, potentially extending the performance upper bound. Current UMM-based refinement methods primarily follow a refinement-via-editing (RvE) paradigm, where UMMs produce editing instructions to modify misaligned regions while preserving aligned content. However, editing instructions often describe prompt-image misalignment only coarsely, leading to incomplete refinement. Moreover, pixel-level preservation, though necessary for editing, unnecessarily restricts the effective modification space for refinement. To address these limitations, we propose Refinement via Regeneration (RvR), a novel framework that reformulates refinement as conditional image regeneration rather than editing. Instead of relying on editing instructions and enforcing strict content preservation, RvR regenerates images conditioned on the target prompt and the semantic tokens of the initial image, enabling more complete semantic alignment with a larger modification space. Extensive experiments demonstrate the effectiveness of RvR, improving Geneval from 0.78 to 0.91, DPGBench from 84.02 to 87.21, and UniGenBench++ from 61.53 to 77.41.

透過再生實現精煉：擴大修改空間提升統一多模態模型的圖像精細化能力

Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

摘要

Support